3260 papers • 126 benchmarks • 313 datasets
Given a video of an arbitrary person, and an arbitrary driving speech, the task is to generate a lip-synced video that matches the given speech. This task requires the approach to not be constrained by identity, voice, or language.
(Image credit: Papersgraph)
These leaderboards are used to track progress in unconstrained-lip-synchronization
Use these libraries to find unconstrained-lip-synchronization models and implementations
No subtasks available.
This work builds a working speech-to-speech translation system by bringing together multiple existing modules from speech and language and incorporates a novel visual module, LipGAN for generating realistic talking faces from the translated audio.
An encoder–decoder convolutional neural network model is developed that uses a joint embedding of the face and audio to generate synthesised talking face video frames and proposed methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.
The proposed framework, named MARLIN, is a facial video masked autoencoder that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition, DeepFake Detection, and Lip Synchronization.
This work investigates the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment, and identifies key reasons pertaining to this and hence resolves them by learning from a powerful lip-sync discriminator.
Adding a benchmark result helps the community track progress.