3260 papers • 126 benchmarks • 313 datasets
Voice cloning is a highly desired feature for personalized speech interfaces. Neural voice cloning system learns to synthesize a person’s voice from only a few audio samples.
(Image credit: Papersgraph)
These leaderboards are used to track progress in voice-cloning-3
No benchmarks available.
Use these libraries to find voice-cloning-3 models and implementations
No subtasks available.
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.
A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.
A speech-text joint pretraining framework, where the spectrogram and the phonemes given a speech example and its transcription areMasked to reconstruct the masked parts of the input in different languages, which shows great improvements over speaker-embedding-based multi-speaker TTS methods.
An approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding mult bilingual speech using more languages and less training data than previous approaches is introduced.
This work investigates different speaker representations and proposed to integrate pretrained and learnable speaker representations to find the embedding pretrained by voice conversion achieves the best performance.
A novel video compression pipeline, called Txt2Vid, is presented, which dramatically reduces data transmission rates by compressing webcam videos to a text transcript, and achieves two to three orders of magnitude reduction in the bitrate as compared to the standard audio-video codecs (encoders-decoders).
A parallel non-autoregressive network to achieve bilingual and code-switched voice conversion for multiple speakers when there are only mono-lingual corpora for each language is described.
This work considers data given as an invertible mixture of two statistically independent components and assumes that one of the components is observed while the other is hidden, and proposes an autoencoder equipped with a discriminator to recover the hidden component.
Adding a benchmark result helps the community track progress.