3260 papers • 126 benchmarks • 313 datasets
Lipreading is a process of extracting speech by watching lip movements of a speaker in the absence of sound. Humans lipread all the time without even noticing. It is a big part in communication albeit not as dominant as audio. It is a very helpful skill to learn especially for those who are hard of hearing. Deep Lipreading is the process of extracting speech from a video of a silent talking face using deep neural networks. It is also known by few other names: Visual Speech Recognition (VSR), Machine Lipreading, Automatic Lipreading etc. The primary methodology involves two stages: i) Extracting visual and temporal features from a sequence of image frames from a silent talking video ii) Processing the sequence of features into units of speech e.g. characters, words, phrases etc. We can find several implementations of this methodology either done in two separate stages or trained end-to-end in one go.
(Image credit: Papersgraph)
These leaderboards are used to track progress in lipreading-2
Use these libraries to find lipreading-2 models and implementations
No subtasks available.
This work presents LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.
An end-to-end deep learning architecture for word-level visual speech recognition that is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks is proposed.
This work compares two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss, built on top of the transformer self-attention architecture.
This is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW).
This paper presents a naturally-distributed large-scale benchmark for lip-reading in the wild, named LRW-1000, which contains 1,000 classes with 718,018 samples from more than 2,000 individual speakers, and is currently the largest word-level lipreading dataset and also the only public large- scale Mandarin lip-read dataset.
This work presents a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner and raises the state-of-the-art performance by a large margin in audio-only, visual- only, and audio-visual experiments.
It is shown that the current state-of-the-art methodology produces models that do not generalize well to variations on the sequence length, and this work addresses this issue by proposing a variable-length augmentation.
audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units.
A two-stage speech recognition model that consistently achieves the state-of-the-art performance by a significant margin is proposed, which demonstrates the necessity and effectiveness of AE-MSR.
This work proposes the addition of prediction-based auxiliary tasks to a VSR model, and highlights the importance of hyperparameter optimization and appropriate data augmentations, and shows that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
Adding a benchmark result helps the community track progress.