Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions (2017-12-16T00:00:00.000000Z)

TL;DR

Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.

Abstract

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and $F_{0}$ features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.

Authors

Z. Chen

13 papers

M. Schuster

5 papers

Ruoming Pang

8 papers

TL;DR

Abstract

Authors

References31 items

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Deep Voice 3: 2000-Speaker Neural Text-to-Speech

Speaker-Dependent WaveNet Vocoder

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Tacotron: Towards End-to-End Speech Synthesis

Deep Voice: Real-time Neural Text-to-Speech

Char2Wav: End-to-End Speech Synthesis

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications

WaveNet: A Generative Model for Raw Audio

Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer

Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Attention-Based Models for Speech Recognition

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Adam: A Method for Stochastic Optimization

Sequence to Sequence Learning with Neural Networks

Neural Machine Translation by Jointly Learning to Align and Translate

Statistical parametric speech synthesis using deep neural networks

Speech Synthesis Based on Hidden Markov Models

Text-to-Speech Synthesis

Statistical Parametric Speech Synthesis

Speech parameter generation algorithms for HMM-based speech synthesis

On supervised learning from sequential data with applications for speech regognition

Long Short-Term Memory

Bidirectional recurrent neural networks

Automatically clustering similar units for unit selection in speech synthesis

Unit selection in a concatenative speech synthesis system using a large speech database

Signal estimation from modified short-time Fourier transform

Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences

Dropout: a simple way to prevent neural networks from overfitting

Mixture Density Networks

Field of Study

Journal Information

Name

Page

Venue Information

Name

Type

URL

Alternate Names