Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Published in

IEEE Transactions on Audio, Speech, and Languag...(2023)

External Links:

Generate Graph

TL;DR

Results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity, and it is found VALL-E could preserve the speaker's emotion and acoustic environment from the prompt in synthesis.

Abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 50 k hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capability and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment from the prompt in synthesis.

Authors

Long Zhou

5 papers

Jinyu Li

5 papers

Yu Wu

6 papers

References83 items

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

Audiobox: Unified Audio Generation with Natural Language Prompts

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Published in

IEEE Transactions on Audio, Speech, and Languag...(2023)

External Links:

Generate Graph

TL;DR

Abstract

Authors

Long Zhou

5 papers

Jinyu Li

5 papers

Yu Wu

6 papers

References83 items

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

Audiobox: Unified Audio Generation with Natural Language Prompts

Chengyi Wang

3 papers

Sanyuan Chen

3 papers

Zi-Hua Zhang

1 papers

Yanqing Liu

1 papers

Huaming Wang

1 papers

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Libriheavy: A 50,000 Hours ASR Corpus with Punctuation Casing and Context

FunCodec: A Fundamental, Reproducible and Integrable Open-Source Toolkit for Neural Speech Codec

RepCodec: A Speech Representation Codec for Speech Tokenization

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

AudioPaLM: A Large Language Model That Can Speak and Listen

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

High-Fidelity Audio Compression with Improved RVQGAN

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

PolyVoice: Language Models for Speech to Speech Translation

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Audiodec: An Open-Source Streaming High-Fidelity Neural Audio Codec

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

PaLM 2 Technical Report

SoundStorm: Efficient Parallel Audio Generation

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

High Fidelity Neural Audio Compression

Flow Matching for Generative Modeling

AudioLM: A Language Modeling Approach to Audio Generation

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

PaLM: Scaling Language Modeling with Pathways

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

SoundStream: An End-to-End Neural Audio Codec

A Survey on Neural Speech Synthesis

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

AdaSpeech: Adaptive Text to Speech for Custom Voice

On Generative Spoken Language Modeling from Raw Audio

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis

Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Denoising Diffusion Probabilistic Models

Language Models are Few-Shot Learners

An Open source Implementation of ITU-T Recommendation P.808 with Validation

TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Understanding and Improving Layer Normalization

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

RoBERTa: A Robustly Optimized BERT Pretraining Approach

VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Waveglow: A Flow-based Generative Network for Speech Synthesis

Sample Efficient Adaptive Text-to-Speech

Neural Speech Synthesis with Transformer Network

Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis

The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Neural Voice Cloning with a Few Samples

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Neural Discrete Representation Learning

SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit

WaveNet: A Generative Model for Raw Audio

Librispeech: An ASR corpus based on public domain audio books

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

“DNSMOS:Anon-intrusivepercep-tualobjectivespeechqualitymetrictoevaluatenoisesuppressors,”in

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

“FastSpeech: Fast, robust and controllable text to speech,”

The Kaldi Speech Recognition Toolkit

“Speak,readandprompt:High-ﬁdelitytext-to-speech withminimalsupervision,”

“Maskgit: Maskedgenerativeimagetransformer,”in

“Codec-superb:Anin-depthanalysisofsoundcodecmodels,”2024

“Speechtokenizer: Uniﬁed speech tokenizer for speech large language models,”

Field of Study

Computer ScienceEngineering

Journal Information

Name

IEEE Transactions on Audio, Speech and Language Processing

Page

705-718

Volume

Venue Information

Name

IEEE Transactions on Audio, Speech, and Language Processing

Type

journal

URL

http://ieeexplore.ieee.org/servlet/opac?punumber=10376

Alternate Names

IEEE Trans Audio Speech Lang Process