3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in audio-visual-synchronization-20
No benchmarks available.
Use these libraries to find audio-visual-synchronization-20 models and implementations
No subtasks available.
In order to handle longer temporal sequences required for sparse synchronisation signals, a multi-modal transformer model that employs 'selectors' to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams is designed.
An MTD-VocaLiST model is proposed, which is trained by the proposed multimodal Transformer distillation (MTD) loss, which enables MTDVocaLiST model to deeply mimic the cross-attention distribution and value-relation in the Transformer of VocaLiST.
This work proposes a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training, which achieves state-of-the-art performance in both dense and sparse settings.
This paper presents a new dataset of music performance videos, Solos, which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross themodal generation and, in general, any audio- visual self-supervised task.
Controllable LPCNet (CLPCNet), an improved LPC net vocoder capable of pitch-shifting and time-stretching of speech is proposed, and it is shown that CLPCNet performs pitch-Shifting of speech on unseen datasets with high accuracy relative to prior neural methods.
An audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio- visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2 is proposed.
The Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking, outperforms the popular model, TalkNet on two datasets.
A novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization, PEAVS, confirming PEAVS efficacy in objectively modeling subjective perceptions of audio-visual synchronization for videos"in the wild".
Adding a benchmark result helps the community track progress.