VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text (2021-04-22T00:00:00.000000Z)

TL;DR

The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training.

Abstract

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600, 72.7% on Kinetics-700, and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training. VATT's source code is publicly available.

Authors

Yin Cui

12 papers

Shih-Fu Chang

6 papers

Boqing Gong

7 papers

TL;DR

Abstract

Authors

References118 items

Scaling Vision with Sparse Mixture of Experts

An Empirical Study of Training Self-Supervised Vision Transformers

Broaden Your Views for Self-Supervised Video Learning

ViViT: A Video Vision Transformer

An Image is Worth 16x16 Words, What is a Video Worth?

MoViNets: Mobile Video Networks for Efficient Video Recognition

Is Space-Time Attention All You Need for Video Understanding?

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Training data-efficient image transformers & distillation through attention

Point Transformer

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Self-supervised Video Representation Learning by Pace Prediction

Spatiotemporal Contrastive Video Representation Learning

Memory-augmented Dense Predictive Coding for Video Representation Learning

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Generative Pretraining From Pixels

Self-Supervised MultiModal Versatile Networks

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Learn to cycle: Time-consistent feature discovery for action recognition

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

Learning Texture Transformer Network for Image Super-Resolution

Language Models are Few-Shot Learners

End-to-End Object Detection with Transformers

Exploring Self-Attention for Image Recognition

Audio-Visual Instance Discrimination with Cross-Modal Agreement

SpeedNet: Learning the Speediness in Videos

X3D: Expanding Architectures for Efficient Video Recognition

Multi-modal Self-Supervision from Generalized Data Transformations

Evolving Losses for Unsupervised Video Representation Learning

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

A Simple Framework for Contrastive Learning of Visual Representations

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Momentum Contrast for Unsupervised Visual Representation Learning

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

UNITER: UNiversal Image-TExt Representation Learning

Learning Video Representations using Contrastive Bidirectional Transformer

Video Representation Learning by Dense Predictive Coding

Contrastive Multiview Coding

Stand-Alone Self-Attention in Vision Models

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Learning Spatio-Temporal Representation With Local and Global Diffusion

Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction

AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Data-Efficient Image Recognition with Contrastive Predictive Coding

Local Relation Networks for Image Recognition

Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks With Octave Convolution

Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

Unsupervised Embedding Learning via Invariant and Spreading Instance Feature

Video Classification With Channel-Separated Convolutional Networks

Weakly Labelled AudioSet Tagging With Attention Neural Networks

Residual Non-local Attention Networks for Image Restoration

D3D: Distilled 3D Networks for Video Action Recognition

SlowFast Networks for Video Recognition

Graph-Based Global Reasoning Networks

Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

TSM: Temporal Shift Module for Efficient Video Understanding

A2-Nets: Double Attention Networks

A Short Note about Kinetics-600

Motion Feature Network: Fixed Motion Filter for Action Recognition

CBAM: Convolutional Block Attention Module

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Unsupervised Feature Learning via Non-parametric Instance Discrimination

Unsupervised Representation Learning by Predicting Image Rotations

Moments in Time Dataset: One Million Videos for Event Understanding

Improved inception-residual convolutional neural network for object recognition

Objects that Sound

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Temporal Relational Reasoning in Videos

Non-local Neural Networks

Attentional Pooling for Action Recognition