mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (2023-02-01T00:00:00.000000Z)

Abstract

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

Authors

References127 items

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Clover: Towards A Unified Video-Language Alignment and Fusion Model

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Visual Question Answering: From Theory to Application

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Revealing Single Frame Bias for Video-and-Language Learning

GIT: A Generative Image-to-text Transformer for Vision and Language

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

CoCa: Contrastive Captioners are Image-Text Foundation Models

Flamingo: a Visual Language Model for Few-Shot Learning

DeiT III: Revenge of the ViT

Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)

All in One: Exploring Unified Video-Language Pre-Training

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

GroupViT: Semantic Segmentation Emerges from Text Supervision

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

End-to-end Generative Pretraining for Multimodal Video Captioning

Bridging Video-text Retrieval with Multiple Choice Questions

A ConvNet for the 2020s

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Co-training Transformer with Videos and Images Improves Action Recognition

FLAVA: A Foundational Language And Vision Alignment Model

Grounded Language-Image Pre-training

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Scaling Up Vision-Language Pretraining for Image Captioning

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Florence: A New Foundation Model for Computer Vision

Swin Transformer V2: Scaling Up Capacity and Resolution

Masked Autoencoders Are Scalable Vision Learners

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

CLIP4Caption: CLIP for Video Caption

Pix2seq: A Language Modeling Framework for Object Detection

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

CBNet: A Composite Backbone Network Architecture for Object Detection

Video Swin Transformer

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

Scaling Vision Transformers

MERLOT: Multimodal Neural Script Knowledge Models

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

ViViT: A Video Vision Transformer

MoViNets: Mobile Video Networks for Efficient Video Recognition

Learning Transferable Visual Models From Natural Language Supervision

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Is Space-Time Attention All You Need for Video Understanding?

VinVL: Making Visual Representations Matter in Vision-Language Models

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Language Models are Few-Shot Learners

End-to-End Object Detection with Transformers

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

PyTorch: An Imperative Style, High-Performance Deep Learning Library

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Objects365: A Large-Scale, High-Quality Dataset for Object Detection

UNITER: UNiversal Image-TExt Representation Learning

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

VisualBERT: A Simple and Performant Baseline for Vision and Language

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Leveraging Pre-trained Checkpoints for Sequence Generation Tasks

RoBERTa: A Robustly Optimized BERT Pretraining Approach

XLNet: Generalized Autoregressive Pretraining for Language Understanding

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Unified Language Model Pre-training for Natural Language Understanding and Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Image Captioning

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Decoupled Weight Decay Regularization

Video Question Answering via Gradually Refined Attention over Appearance and Motion

Localizing Moments in Video with Natural Language

Attention is All you Need

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

100

Mask R-CNN

101

Modeling Context in Referring Expressions

102

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

103

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

104

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

105

Deep Residual Learning for Image Recognition

106

Generation and Comprehension of Unambiguous Object Descriptions

107

A Neural Attention Model for Abstractive Sentence Summarization

108

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

109

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

110

A dataset for Movie Description

111

Deep visual-semantic alignments for generating image descriptions

112

Microsoft COCO: Common Objects in Context

113

Im2Text: Describing Images Using 1 Million Captioned Photographs

114

Collecting Highly Parallel Data for Paraphrase Evaluation

115

ImageNet: A large-scale hierarchical image database

116

2022d), we concatenate all descriptions of a video as a paragraph, and evaluate the paragraph-to-video

117

2022a), the dataset is divided into 9K and 1K videos for training and testing

118

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

119

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

120

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

121

ELEC-TRA: pre-training text encoders as discriminators rather than generators

122

2018) respectively. C = 768 and C = 1024 for mPLUG-2Base and mPLUG-2 . We set S = 2 for universal layers for the good empirical performance, and choose G = C for multi-group mechanism

123

The kinetics human action video

124

Dataset Description Text-to-Video Retrieval We evaluate mPLUG-2 on three popular text-to-video retrieval datasets including MSRVTT (Xu et al., 2016)

125

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

126

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

127

ImageNet-1K contains 1.28M training images and 50K validation images from 1,000 classes