speech-8

Visual Speech Recognition

3260 papers • 126 benchmarks • 313 datasets

This task has no description! Would you like to contribute one?

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in visual-speech-recognition-8

Trend

Dataset

Best Model

Actions

LRS2

LRS3-TED

Libraries

i

Use these libraries to find visual-speech-recognition-8 models and implementations

Datasets

LRS2

LRS3-TED

CAS-VSR-W1k (LRW-1000)

AV Digits Database

GLips

Subtasks

Lip to Speech Synthesis

Most implemented papers

Combining Residual Networks with LSTMs for Lipreading

Georgios Tzimiropoulos, Themos Stafylakis•Sat Mar 11 2017

An end-to-end deep learning architecture for word-level visual speech recognition that is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks is proposed.

334

Content

0

Paper Graph

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

Xilin Chen, S. Shan, Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Keyu Long•Mon Oct 15 2018

This paper presents a naturally-distributed large-scale benchmark for lip-reading in the wild, named LRW-1000, which contains 1,000 classes with 718,018 samples from more than 2,000 individual speakers, and is currently the largest word-level lipreading dataset and also the only public large- scale Mandarin lip-read dataset.

176 0

Paper Graph

Deep Audio-Visual Speech Recognition

O. Vinyals, Andrew Zisserman, A. Senior, Triantafyllos Afouras, Joon Son Chung•Wed Sep 05 2018

This work compares two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss, built on top of the transformer self-attention architecture.

825 0

Paper Graph

Visual speech recognition for multiple languages in the wild

Pingchuan Ma, M. Pantic, Stavros Petridis•Fri Feb 25 2022

This work proposes the addition of prediction-based auxiliary tasks to a VSR model, and highlights the importance of hyperparameter optimization and appropriate data augmentations, and shows that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.

197 0

Paper Graph

Deep Word Embeddings for Visual Speech Recognition

Georgios Tzimiropoulos, Themos Stafylakis•Sun Oct 29 2017

The proposed architecture goes beyond state-of-the-art on closed-set word identification, by attaining 11.92% error rate on a vocabulary of 500 words, and demonstrates that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set.

22 0

Paper Graph

Zero-shot keyword spotting for visual speech recognition in-the-wild

Georgios Tzimiropoulos, Themos Stafylakis•Sun Jul 22 2018

This paper proposes the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation, and demonstrates that this system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training.

40 0

Paper Graph

Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition

Debanjan Mahata, R. Shah, Roger Zimmermann, Yaman Kumar Singla, Dhruva Sahrawat, Shubham Maheshwari, A. Stent, Yifang Yin•Mon Jan 28 2019

This work presents a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and shows how the addition of unseen class samples increases the accuracy of a VSR system by a significant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases.

15 0

Paper Graph

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Yannis Assael, Brendan Shillingford, Takaki Makino, H. Liao, Basi García, Otavio Braga, Olivier Siohan•Thu Nov 07 2019

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture and significantly improves the state-of-the-art on the LRS3-TED set.

143 0

Paper Graph

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition

Xilin Chen, S. Shan, Jingyun Xiao, Yuanhang Zhang, Shuang Yang•Thu Mar 05 2020

A comprehensive study on the evaluation of the effects of different facial regions with state-of-the-art VSR models, including the mouth, the whole face, the upper face, and even the cheeks, finding that despite the complex variations of the data, incorporating information from extraoral facial regions, even the higher face, consistently benefits VSR performance.

71 0

Paper Graph

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

George Sterpu, Christian Saam, Naomi Harte•Thu Mar 12 2020

The inner workings of AV Align are investigated and a regularisation method which involves predicting lip-related Action Units from visual representations is proposed which leads to better exploitation of the visual modality and encourages researchers to rethink the multimodal convergence problem when having one dominant modality.

33 0

Paper Graph

Adding a benchmark result helps the community track progress.