mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections (2022-05-24T00:00:00.000000Z)

TL;DR

mPLUG is a new vision-language foundation model for both cross-modal understanding and generation that achieves state-of-the-art results on a wide range of vision- language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.

Abstract

Large-scale pre-trained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross-modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections.mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability on vision-language and video-language tasks. The code and pre-trained models are available at https://github.com/alibaba/AliceMind

Authors

Jingren Zhou

13 papers

Luo Si

13 papers

Guohai Xu

6 papers

TL;DR

Abstract

Authors

References85 items

Named Entity and Relation Extraction with Multi-Modal Retrieval

GIT: A Generative Image-to-text Transformer for Vision and Language

Hierarchical Text-Conditional Image Generation with CLIP Latents

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

Scaling Up Vision-Language Pretraining for Image Captioning

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Florence: A New Foundation Model for Computer Vision

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

FILIP: Fine-grained Interactive Language-Image Pre-Training

An Empirical Study of Training End-to-End Vision-and-Language Transformers

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

How Much Can CLIP Benefit Vision-and-Language Tasks?

MERLOT: Multimodal Neural Script Knowledge Models

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

VinVL: Revisiting Visual Representations in Vision-Language Models

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Learning Transferable Visual Models From Natural Language Supervision

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Unifying Vision-and-Language Tasks via Text Generation

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

VinVL: Making Visual Representations Matter in Vision-Language Models

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Rethinking Skip Connection with Layer Normalization

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

PALM: Pre-training an Autoencoding&autoregressive Language Model for Context-conditioned Generation

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Momentum Contrast for Unsupervised Visual Representation Learning

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Randaugment: Practical automated data augmentation with a reduced search space

UNITER: UNiversal Image-TExt Representation Learning

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

VisualBERT: A Simple and Performant Baseline for Vision and Language

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

A Corpus for Reasoning about Natural Language Grounded in Photographs

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Decoupled Weight Decay Regularization

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Attention is All you Need

Audio Set: An ontology and human-labeled dataset for audio events

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Self-Critical Sequence Training for Image Captioning

Densely Connected Convolutional Networks

Modeling Context in Referring Expressions

Training Deep Nets with Sublinear Memory Cost

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Deep Residual Learning for Image Recognition

Generation and Comprehension of Unambiguous Object Descriptions

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

VQA: Visual Question Answering

Highway Networks

Microsoft COCO Captions: Data Collection and Evaluation Server

Deep visual-semantic alignments for generating image descriptions

Microsoft COCO: Common Objects in Context

Im2Text: Describing Images Using 1 Million Captioned Photographs

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

2022): proposes a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. It effectively utilizes the noisy web data by bootstrapping the captions

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Achieving human parity on visual