LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment (2023-10-03T00:00:00.000000Z)

TL;DR

This work proposes LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics, and freezes the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning.

Abstract

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 5.8% R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot video-text retrieval task. Beyond this, our LanguageBind has greatly improved in the zero-shot video, audio, depth, and infrared understanding tasks. For instance, LanguageBind surpassing InterVideo by 1.9% on MSR-VTT, 8.8% on MSVD, 6.3% on DiDeMo, and 4.4% on ActivityNet. On the LLVIP and NYU-D datasets, LanguageBind outperforms ImageBind with 23.8% and 11.1% top-1 accuracy. Code address: https://github.com/PKU-YuanGroup/LanguageBind.

References76 items

Meta-Transformer: A Unified Framework for Multimodal Learning

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

ImageBind One Embedding Space to Bind Them All

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

AIM: Adapting Image Models for Efficient Video Action Recognition

Edge-guided Multi-domain RGB-to-TIR image Translation for Training Vision Tasks with Challenging Labels

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Scaling Language-Image Pre-Training via Masking

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

Learning Audio-Video Modalities from Image Captions

Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth

PointCLIP: Point Cloud Understanding by CLIP

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Masked Autoencoders Are Scalable Vision Learners

Ego4D: Around the World in 3,000 Hours of Egocentric Video

LLVIP: A Visible-infrared Paired Dataset for Low-light Vision

LoRA: Low-Rank Adaptation of Large Language Models

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Learning Transferable Visual Models From Natural Language Supervision

A Straightforward Framework For Video Retrieval Using CLIP

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Rethinking CNN Models for Audio Classification

ActBERT: Learning Global-Local Video-Text Representations

Vggsound: A Large-Scale Audio-Visual Dataset

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

AudioCaps: Generating Captions for Audios in The Wild

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Convolutional Neural Networks for Static and Dynamic Breast Infrared Imaging Classification

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Audio-Visual Event Localization in Unconstrained Videos

Localizing Moments in Video with Natural Language

Attention is All you Need

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

The Kinetics Human Action Video Dataset

Audio Set: An ontology and human-labeled dataset for audio events

YouTube-8M: A Large-Scale Video Classification Benchmark

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks

Content-Based Video Recommendation System Based on Stylistic Visual Features

Deep Residual Learning for Image Recognition

ESC: Dataset for Environmental Sound Classification

ActivityNet: A large-scale video benchmark for human activity understanding

Interactive intrinsic video editing

Large-Scale Video Classification with Convolutional Neural Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Microsoft COCO: Common Objects in Context

Freesound technical demo

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Indoor Segmentation and Support Inference from RGBD Images

HMDB: A large video database for human motion recognition

Collecting Highly Parallel Data for Paraphrase Evaluation

ImageNet: A large-scale hierarchical image database

Recognizing human actions: a local SVM approach

Simplifying video editing using metadata

Image and video search engine for the World Wide Web

Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Teledyne

Under review as a conference paper at ICLR 2016

Free teledyne flir thermal dataset for algorithm training

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2012) to validate by 654 test samples. Through preprocessing, we constrained the depth images to a maximum depth of 10 meters. Following ImageBind, we undertook a category reorganization process

2016) comprises 10K YouTube videos, each paired by 200K captions

We validate the zero-shot classification capability with the ESC-50 (Piczak, 2015) dataset, which has 2000 test audios, each uniquely labelled. For zero-shot retrieval

D LICENSE Unless explicitly noted otherwise, our released datasets are provided to users under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License

Accepted by ICLR 2024