3260 papers • 126 benchmarks • 313 datasets
Given a silent video of a speaker, generate the corresponding speech that matches the lip movements.
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-speech-recognition
Use these libraries to find visual-speech-recognition models and implementations
This work proposes a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time and shows that its method is four times more intelligible than previous works in this space.
A novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis and synchronization learning is introduced as a form of contrastive learning that guides the generator to synthesize a speech in sync with the given input lip movements.
This paper proposes to use quantized self-supervised speech representations, named speech units, as an additional prediction target for the L2S model, and introduces a multi-input vocoder that can generate a clear waveform even from blurry and noisy mel-spectrogram by referring to the speech units.
This work presents a novel method, "Lip2Speech", where the speaker's voice identity is captured through their facial characteristics, i.e., age, gender, ethnicity and condition them along with the lip movements to generate speaker identity aware speech.
A powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment is developed and verified using LRS2, LRS3, and LRW datasets.
Adding a benchmark result helps the community track progress.