Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

audio-7

Text-To-Speech Synthesis

3260 papers • 126 benchmarks • 313 datasets

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in text-to-speech-synthesis-23

Trend

Dataset

Best Model

Actions

LJSpeech

LJSpeech

CMUDict 0.7b

CMUDict 0.7b

20000 utterances

20000 utterances

Libraries

i

Use these libraries to find text-to-speech-synthesis-23 models and implementations

PaddlePaddle/PaddleSpeech

12 papers 9,288

Datasets

LJSpeech

LibriTTS

AISHELL-3

CVSS

20000 utterances

20000 utterances

Gumar Corpus

Subtasks

Prosody Prediction Zero-Shot Multi-Speaker TTS

Most implemented papers

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Tao Qin, Tie-Yan Liu, Zhou Zhao, Xu Tan, Yi Ren, Sheng Zhao, Chenxu Hu•Sun Jun 07 2020

FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.

1659

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

HUI speech corpus

HUI speech corpus

Thorsten voice 21.02 neutral

Thorsten voice 21.02 neutral

10 papers 23,793

keonlee9420/Expressive-FastSpeech2

5 papers 244

TensorSpeech/TensorflowTTS

4 papers 3,571

CorentinJ/Real-Time-Voice-Cloning

3 papers 49,053

3 papers 258

keonlee9420/STYLER

3 papers 142

tigthor/Voice-Cloning-AI

3 papers 26

dathudeptrai/TensorflowTTS

3 papers 12

PaddlePaddle/DeepSpeech

2 papers 9,278

MoonInTheRiver/DiffSinger

2 papers 3,860

SOMOS

SOMOS

KazakhTTS

HUI speech corpus

HUI speech corpus

SpeechInstruct

0

Tacotron: Towards End-to-End Speech Synthesis

Quoc V. Le, Z. Chen, Samy Bengio, N. Jaitly, R. Saurous, Ron J. Weiss, R. Clark, Yonghui Wu, Yuxuan Wang, R. Skerry-Ryan, Daisy Stanton, Zongheng Yang, Y. Xiao, Yannis Agiomyrgiannakis•Tue Mar 28 2017

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

1985 0

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Shunsuke Aihara, Hideyuki Tachibana, Katsuya Uenoyama•Mon Oct 23 2017

This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units, to alleviate the economic costs of training.

277 0

Efficient Neural Audio Synthesis

K. Simonyan, Nal Kalchbrenner, K. Kavukcuoglu, S. Dieleman, Erich Elsen, Aäron van den Oord, Edward Lockhart, Seb Noury, Norman Casagrande, Florian Stimberg•Thu Feb 22 2018

A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.

915 0

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim•Thu Oct 24 2019

The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.

948 0

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Eric Battenberg, R. Saurous, Yu Zhang, Ye Jia, Yuxuan Wang, Joel Shor, R. Skerry-Ryan, Daisy Stanton, Y. Xiao, Fei Ren•Thu Mar 22 2018

"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

893 0

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Z. Chen, Ruoming Pang, Ron J. Weiss, Yu Zhang, Ye Jia, Yonghui Wu, Quan Wang, I. López-Moreno, Jonathan Shen, Fei Ren, Patrick Nguyen•Thu May 31 2018

It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

923 0

WaveGrad: Estimating Gradients for Waveform Generation

Mohammad Norouzi, Ron J. Weiss, H. Zen, Yu Zhang, William Chan, Nanxin Chen•Tue Sep 01 2020

WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality.

896 0

Adding a benchmark result helps the community track progress.

Text-To-Speech Synthesis | State-of-the-Art