3260 papers • 126 benchmarks • 313 datasets
Talking face generation aims to synthesize a sequence of face images that correspond to given speech semantics ( Image credit: Talking Face Generation by Adversarially Disentangled Audio-Visual Representation )
(Image credit: Papersgraph)
These leaderboards are used to track progress in talking-face-generation-1
Use these libraries to find talking-face-generation-1 models and implementations
This work investigates the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment, and identifies key reasons pertaining to this and hence resolves them by learning from a powerful lip-sync discriminator.
A method that generates expressive talking heads from a single facial image with audio as the only input that is able to synthesize photorealistic videos of entire talking heads with full range of motion and also animate artistic paintings, sketches, 2D cartoon characters, Japanese mangas, stylized caricatures in a single unified framework.
A novel conditional video generation network where the audio input is treated as a condition for the recurrent adversarial network such that temporal dependency is incorporated to realize smooth transition for the lip and facial movement is proposed.
This work finds that the talking face sequence is actually a composition of both subject- related information and speech-related information, and learns disentangled audio-visual representation, which has an advantage where both audio and video can serve as inputs for generation.
The proposed method, known as ReenactGAN, is capable of transferring facial movements and expressions from an arbitrary person's monocular video input to a target person’s video, and can perform photo-realistic face reenactment.
A unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers is introduced and VOCA (Voice Operated Character Animation) is learned, the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting.
This work presents Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis that generalizes across different people, allowing it to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches.
An end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video synchronized with the speech and expressing the conditioned emotion is designed.
This work presents an unsupervised stochastic audio-to-video generation model that can capture multiple modes of the video distribution and does so through a principled multi-modal variational autoencoder framework.
Experimental results demonstrate that the novel framework can produce high-fidelity and natural results, and support free adjustment of audio signals, viewing directions, and background images.
Adding a benchmark result helps the community track progress.