3260 papers • 126 benchmarks • 313 datasets
Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information. Source: Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet
(Image credit: Papersgraph)
These leaderboards are used to track progress in voice-conversion
Use these libraries to find voice-conversion models and implementations
No subtasks available.
This paper proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source andtarget speaker do not even need to be seen during training.
This work uses a cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) and an identity-mapping loss to learn a mapping from source to target speech without relying on parallel data.
CycleGAN-VC2 is proposed, which is an improved version of CycleGAN- VC incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN).
Results confirm that the proposed deep learning-based assessment models could be used as a computational evaluator to measure the MOS of VC systems to reduce the need for expensive human rating.
SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch and rhythm without text labels and can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.
This paper uses self-supervised pre-trained models for MOS prediction and shows their representations can distinguish between clean and noisy audios and outperforms the two previous state-of-the-art models by a significant improvement on Voice Conversion Challenge 2018.
Experimental results on the ASVspoof 2019 dataset demonstrate that high-level representations extracted by Mockingjay can prevent the transferability of adversarial examples, and successfully counter black-box attacks.
An SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora and removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system is proposed.
Adding a benchmark result helps the community track progress.