3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-speech-recognition-8
Use these libraries to find visual-speech-recognition-8 models and implementations
An end-to-end deep learning architecture for word-level visual speech recognition that is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks is proposed.
This paper presents a naturally-distributed large-scale benchmark for lip-reading in the wild, named LRW-1000, which contains 1,000 classes with 718,018 samples from more than 2,000 individual speakers, and is currently the largest word-level lipreading dataset and also the only public large- scale Mandarin lip-read dataset.
This work compares two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss, built on top of the transformer self-attention architecture.
This work proposes the addition of prediction-based auxiliary tasks to a VSR model, and highlights the importance of hyperparameter optimization and appropriate data augmentations, and shows that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
The proposed architecture goes beyond state-of-the-art on closed-set word identification, by attaining 11.92% error rate on a vocabulary of 500 words, and demonstrates that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set.
This paper proposes the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation, and demonstrates that this system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training.
This work presents a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and shows how the addition of unseen class samples increases the accuracy of a VSR system by a significant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases.
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture and significantly improves the state-of-the-art on the LRS3-TED set.
A comprehensive study on the evaluation of the effects of different facial regions with state-of-the-art VSR models, including the mouth, the whole face, the upper face, and even the cheeks, finding that despite the complex variations of the data, incorporating information from extraoral facial regions, even the higher face, consistently benefits VSR performance.
The inner workings of AV Align are investigated and a regularisation method which involves predicting lip-related Action Units from visual representations is proposed which leads to better exploitation of the visual modality and encourages researchers to rethink the multimodal convergence problem when having one dominant modality.
Adding a benchmark result helps the community track progress.