Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

accented-speech-recognition

Speech Synthesis

3260 papers • 126 benchmarks • 313 datasets

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc. Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk. ( Image credit: WaveNet: A generative model for raw audio )

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in accented-speech-recognition

Trend

Dataset

Best Model

Actions

LibriTTS

LibriTTS

North American English

North American English

LJSpeech

LJSpeech

Libraries

i

Use these libraries to find accented-speech-recognition models and implementations

15 papers 30,917

Datasets

LJSpeech

LibriTTS

THCHS-30

CSS10

PromptSpeech

JSUT Corpus

Subtasks

Expressive Speech Synthesis Emotional Speech Synthesis text-to-speech translation Speech Synthesis - Tamil Speech Synthesis - Tamil

Most implemented papers

WaveNet: A Generative Model for Raw Audio

O. Vinyals, K. Simonyan, Nal Kalchbrenner, K. Kavukcuoglu, A. Senior, Alex Graves, S. Dieleman, Aäron van den Oord, H. Zen•Sun Sep 11 2016

WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

8007

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

Speech Synthesis | State-of-the-Art

Mandarin Chinese

Mandarin Chinese

PaddlePaddle/PaddleSpeech

15 papers 10,381

TensorSpeech/TensorflowTTS

6 papers 3,746

keonlee9420/Expressive-FastSpeech2

4 papers 263

dathudeptrai/TensorflowTTS

4 papers 13

CorentinJ/Real-Time-Voice-Cloning

3 papers 51,254

pytorch/fairseq

3 papers 29,576

PaddlePaddle/DeepSpeech

3 papers 10,382

3 papers 1,114

3 papers 1,114

3 papers 274

keonlee9420/STYLER

3 papers 152

keonlee9420/Comprehensive-E2E-TTS

3 papers 141

HaiFengZeng/clari_wavenet_vocoder

3 papers 56

tigthor/Voice-Cloning-AI

3 papers 31

facebookresearch/fairseq

2 papers 29,577

plachtaa/vall-e-x

2 papers 7,361

2 papers 3,784

sh-lee-prml/hierspeechpp

2 papers 1,110

rongjiehuang/transpeech

2 papers 159

playvoice/grad-svc

2 papers 114

jasminsternkopf/mel_cepstral_distan…

2 papers 41

Gumar Corpus

Gumar Corpus

SOMOS

TaL Corpus

HUI speech corpus

HUI speech corpus

Speech Synthesis - Kannada

Speech Synthesis - Malayalam

Speech Synthesis - Telugu

Speech Synthesis - Assamese

Speech Synthesis - Bengali

Speech Synthesis - Bodo

Speech Synthesis - Gujarati

Speech Synthesis - Hindi

Speech Synthesis - Manipuri

Speech Synthesis - Marathi

Speech Synthesis - Rajasthani

0

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Tao Qin, Tie-Yan Liu, Zhou Zhao, Xu Tan, Yi Ren, Sheng Zhao, Chenxu Hu•Sun Jun 07 2020

FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.

1659 0

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Z. Chen, M. Schuster, Ruoming Pang, N. Jaitly, R. Saurous, Ron J. Weiss, Yu Zhang, Yonghui Wu, Yuxuan Wang, R. Skerry-Ryan, Zongheng Yang, Yannis Agiomyrgiannakis, Jonathan Shen•Fri Dec 15 2017

Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.

2954 0

Tacotron: Towards End-to-End Speech Synthesis

Quoc V. Le, Z. Chen, Samy Bengio, N. Jaitly, R. Saurous, Ron J. Weiss, R. Clark, Yonghui Wu, Yuxuan Wang, R. Skerry-Ryan, Daisy Stanton, Zongheng Yang, Y. Xiao, Yannis Agiomyrgiannakis•Tue Mar 28 2017

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

1985 0

FastTacotron: A Fast, Robust and Controllable Method for Speech Synthesis

D. V. Sang, L. Thu•Thu Sep 30 2021

Recent state-of-the-art neural text-to-speech synthesis models have significantly improved the quality of synthesized speech. However, the previous methods have remained several problems. While autoregressive models suffer from slow inference speed, non-autoregressive models usually have a complicated, time and memory-consuming training pipeline. This paper proposes a novel model called FastTacotron, which is an improved text-to-speech method based on ForwardTacotron. The proposed model uses the recurrent Tacotron architecture but replacing its autoregressive attentive part with a single forward pass to accelerate the inference speed. The model also replaces the attention mechanism in Tacotron with a length regulator like the one in FastSpeech for parallel mel-spectrogram generation. Moreover, we introduce more prosodic information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs to make the duration predictor more accurate. Experiments show that our model matches state-of-the-art models in terms of speech quality and inference speed, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and possible to control the speed and pitch of the generated utterance. More importantly, our model can converge just in few hours of training, which is up to 11.2x times faster than existing methods. Furthermore, the memory requirement of our model grows linearly with sequence length, which makes it possible to predict complete articles at one time with the model. Audio samples can be found in https://bit.ly/3xguaCW.

8 0

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Yoshua Bengio, Aaron C. Courville, Rithesh Kumar, Jose M. R. Sotelo, Kundan Kumar, A. D. Brébisson, T. Boissiere, L. Gestin, Wei Zhen Teoh•Mon Oct 07 2019

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

1080 0

Efficient Neural Audio Synthesis

K. Simonyan, Nal Kalchbrenner, K. Kavukcuoglu, S. Dieleman, Erich Elsen, Aäron van den Oord, Edward Lockhart, Seb Noury, Norman Casagrande, Florian Stimberg•Thu Feb 22 2018

A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.

915 0

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim•Thu Oct 24 2019

The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.

948 0

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Eric Battenberg, R. Saurous, Yu Zhang, Ye Jia, Yuxuan Wang, Joel Shor, R. Skerry-Ryan, Daisy Stanton, Y. Xiao, Fei Ren•Thu Mar 22 2018

"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

893 0

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Z. Chen, Ruoming Pang, Ron J. Weiss, Yu Zhang, Ye Jia, Yonghui Wu, Quan Wang, I. López-Moreno, Jonathan Shen, Fei Ren, Patrick Nguyen•Thu May 31 2018

It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

923 0

Adding a benchmark result helps the community track progress.