Lipreading

Lipreading is a process of extracting speech by watching lip movements of a speaker in the absence of sound. Humans lipread all the time without even noticing. It is a big part in communication albeit not as dominant as audio. It is a very helpful skill to learn especially for those who are hard of hearing. Deep Lipreading is the process of extracting speech from a video of a silent talking face using deep neural networks. It is also known by few other names: Visual Speech Recognition (VSR), Machine Lipreading, Automatic Lipreading etc. The primary methodology involves two stages: i) Extracting visual and temporal features from a sequence of image frames from a silent talking video ii) Processing the sequence of features into units of speech e.g. characters, words, phrases etc. We can find several implementations of this methodology either done in two separate stages or trained end-to-end in one go.

Benchmarks

Libraries

Datasets

Subtasks

Most implemented papers

LipNet: End-to-End Sentence-level Lipreading

Content

Combining Residual Networks with LSTMs for Lipreading

Deep Audio-Visual Speech Recognition

End-to-End Audiovisual Speech Recognition

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

End-To-End Audio-Visual Speech Recognition with Conformers

Lipreading Using Temporal Convolutional Networks

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Discriminative Multi-Modality Speech Recognition

Visual speech recognition for multiple languages in the wild