Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (2018-03-23T00:00:00.000000Z)

TL;DR

"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Abstract

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Authors

Eric Battenberg

3 papers

R. Saurous

7 papers

Yu Zhang

11 papers

TL;DR

Abstract

Authors

References32 items

Front-End Factor Analysis For Speaker Verification

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Neural Discrete Representation Learning

Uncovering Latent Style Factors for Expressive Speech Synthesis

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Deep Voice 3: 2000-Speaker Neural Text-to-Speech

Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data

Unsupervised Learning for Expressive Speech Synthesis

Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home

Voice Synthesis for in-the-Wild Speakers via a Phonological Loop

Adapting and controlling DNN-based speech synthesis using input codes

Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation

Attention is All you Need

Tacotron: Towards End-to-End Speech Synthesis

Deep Voice: Real-time Neural Text-to-Speech

Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine

WaveNet: A Generative Model for Raw Audio

Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

A note on the evaluation of generative models

Neural Turing Machines

Conditional restricted Boltzmann machine for voice conversion

Unsupervised clustering of emotion and voice styles for expressive TTS

Front-End Factor Analysis for Speaker Verification

ToBI or not toBI?

TOBI: a standard for labeling English prosody

Signal estimation from modified short-time Fourier transform

AutoBI - a tool for automatic toBI annotation

Visualizing Data using t-SNE

Text-to-speech synthesis

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Field of Study

Venue Information

Name

Type

URL

Alternate Names