3260 papers • 126 benchmarks • 313 datasets
Speech synthesis is the task of generating speech from some other modality like text, lip movements etc. Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk. ( Image credit: WaveNet: A generative model for raw audio )
(Image credit: Papersgraph)
These leaderboards are used to track progress in accented-speech-recognition
Use these libraries to find accented-speech-recognition models and implementations
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.
Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Recent state-of-the-art neural text-to-speech synthesis models have significantly improved the quality of synthesized speech. However, the previous methods have remained several problems. While autoregressive models suffer from slow inference speed, non-autoregressive models usually have a complicated, time and memory-consuming training pipeline. This paper proposes a novel model called FastTacotron, which is an improved text-to-speech method based on ForwardTacotron. The proposed model uses the recurrent Tacotron architecture but replacing its autoregressive attentive part with a single forward pass to accelerate the inference speed. The model also replaces the attention mechanism in Tacotron with a length regulator like the one in FastSpeech for parallel mel-spectrogram generation. Moreover, we introduce more prosodic information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs to make the duration predictor more accurate. Experiments show that our model matches state-of-the-art models in terms of speech quality and inference speed, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and possible to control the speed and pitch of the generated utterance. More importantly, our model can converge just in few hours of training, which is up to 11.2x times faster than existing methods. Furthermore, the memory requirement of our model grows linearly with sequence length, which makes it possible to predict complete articles at one time with the model. Audio samples can be found in https://bit.ly/3xguaCW.
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.
A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.
The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Adding a benchmark result helps the community track progress.