Results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity, and it is found VALL-E could preserve the speaker's emotion and acoustic environment from the prompt in synthesis.
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 50 k hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capability and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment from the prompt in synthesis.
Shujie Liu
7 papers
Zhuo Chen
9 papers
Furu Wei
37 papers
Sheng Zhao
3 papers
Chengyi Wang
3 papers
Sanyuan Chen
3 papers
Zi-Hua Zhang
1 papers
Yanqing Liu
1 papers
Huaming Wang
1 papers
Lei He
1 papers