Tacotron: Towards End-to-End Speech Synthesis (2017-03-29T00:00:00.000000Z)

TL;DR

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

Abstract

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

Authors

Quoc V. Le

42 papers

Z. Chen

13 papers

Samy Bengio

13 papers

Tacotron: Towards End-to-End Speech Synthesis

TL;DR

Abstract

Authors

References26 items

Deep Voice: Real-time Neural Text-to-Speech

Char2Wav: End-to-End Speech Synthesis

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

RNN Approaches to Text Normalization: A Challenge

Fully Character-Level Neural Machine Translation without Explicit Segmentation

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

WaveNet: A Generative Model for Raw Audio

First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention

Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer

Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Deep Residual Learning for Image Recognition

A note on the evaluation of generative models

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Highway Networks

Vocaine the vocoder and applications in speech synthesis

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Grammar as a Foreign Language

Adam: A Method for Stochastic Optimization

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Sequence to Sequence Learning with Neural Networks

Neural Machine Translation by Jointly Learning to Align and Translate

Text-to-Speech Synthesis

Statistical Parametric Speech Synthesis

Signal estimation from modified short-time Fourier transform

Field of Study

Venue Information

Name

Type

URL

Alternate Names