3260 papers • 126 benchmarks • 313 datasets
Given a video of a person speaking in a source language, generate a video of the same person speaking in a target language.
(Image credit: Papersgraph)
These leaderboards are used to track progress in multimodal-machine-translation
No benchmarks available.
Use these libraries to find multimodal-machine-translation models and implementations
No datasets available.
No subtasks available.
The proposed method tries to perform a pixel alignment rather than eye alignment by mapping the geometry of faces to a reference face while keeping their own textures, and shows great improvement in comparison to eye-aligned recognition.
This work builds a working speech-to-speech translation system by bringing together multiple existing modules from speech and language and incorporates a novel visual module, LipGAN for generating realistic talking faces from the translated audio.
This work introduces a data-driven approach for unsupervised video retargeting that translates content from one domain to another while preserving the style native to a domain, i.e., if contents of John Oliver's speech were to be transferred to Stephen Colbert, then the generated content/speech should be in Stephen Colbert's style.
This work incorporates the triple consistency loss into the training of a new landmark-guided face to face synthesis, where, contrary to previous works, the generated images can simultaneously undergo a large transformation in both expression and pose.
Adding a benchmark result helps the community track progress.