WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training (2021-03-11T00:00:00.000000Z)

TL;DR

This work proposes a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework that outperforms both UNITER and OpenAI CLIP on various downstream tasks and builds a large queue-based dictionary that can incorporate more negative samples in limited GPU resources.

Abstract

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.

Authors

References42 items

Learning Transferable Visual Models From Natural Language Supervision

Exploring Simple Siamese Representation Learning

Emerging Trends of Multimodal Research in Vision and Language

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Contrastive learning, multi-view redundancy, and linear models

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

VirTex: Learning Visual Representations from Textual Annotations

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Language Models are Few-Shot Learners

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

XGPT: Cross-modal Generative Pre-Training for Image Captioning

A Simple Framework for Contrastive Learning of Visual Representations

REALM: Retrieval-Augmented Language Model Pre-Training

How Much Knowledge Can You Pack into the Parameters of a Language Model?

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Momentum Contrast for Unsupervised Visual Representation Learning

Self-Training With Noisy Student Improves ImageNet Classification

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

UNITER: UNiversal Image-TExt Representation Learning

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Learning Representations by Maximizing Mutual Information Across Views

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Local Aggregation for Unsupervised Learning of Visual Embeddings

Recursive Visual Attention in Visual Dialog

Learning deep representations by mutual information estimation and maximization

Representation Learning with Contrastive Predictive Coding

Unsupervised Feature Learning via Non-parametric Instance Discrimination

Deep Contextualized Word Representations

Universal Language Model Fine-tuning for Text Classification

AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

Generative Adversarial Text to Image Synthesis

Semi-supervised Sequence Learning

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Fast R-CNN

DALL-E: CREATING IMAGES FROM TEXT

Self-Supervised Relationship Probing

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

Improving Language Understanding by Generative Pre-Training