Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction (2022-01-05T00:00:00.000000Z)

TL;DR

audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units.

Abstract

Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert

Authors

Wei-Ning Hsu

10 papers

Bowen Shi

6 papers

Abdel-rahman Mohamed

5 papers

TL;DR

Abstract

Authors

References52 items

Multi-Modal Pre-Training for Automated Speech Recognition

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

End-To-End Audio-Visual Speech Recognition with Conformers

DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Parameter Efficient Multimodal Transformers for Video Representation Learning

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Modality Dropout for Improved Performance-driven Talking Faces

Discriminative Multi-Modality Speech Recognition

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Audio-Visual Instance Discrimination with Cross-Modal Agreement

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Lipreading Using Temporal Convolutional Networks

ASR is All You Need: Cross-Modal Distillation for Lip Reading

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Reducing Transformer Depth on Demand with Structured Dropout

UNITER: UNiversal Image-TExt Representation Learning

Evolving Losses for Unlabeled Video Representation Learning

Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization

An Unsupervised Autoregressive Model for Speech Representation Learning

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities

Deep Audio-Visual Speech Recognition

LRS3-TED: a large-scale dataset for visual speech recognition

Deep Clustering for Unsupervised Learning of Visual Features

Large-Scale Visual Speech Recognition

Representation Learning with Contrastive Predictive Coding

VoxCeleb2: Deep Speaker Recognition

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Lip Movements Generation at a Glance

Look, Listen and Learn

Combining Residual Networks with LSTMs for Lipreading

Layer Normalization

End-to-end attention-based large vocabulary speech recognition

Adam: A Method for Stochastic Optimization

Dlib-ml: A Machine Learning Toolkit

Investigating the psycholinguistic correlates of speechreading in preschool age children.

Imitation of Facial and Manual Gestures by Human Neonates

Hearing lips and seeing voices

Visual contribution to speech intelligibility in noise

2021a) achieves ∼ 10% improvement in a similar setting

Conformer CTC+S2S

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names