Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (2021-02-11T00:00:00.000000Z)

TL;DR

A noisy dataset of over one billion image alt-text pairs is leverage, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.

Abstract

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.

Authors

References81 items

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Learning Transferable Visual Models From Natural Language Supervision

Learning the Best Pooling Strategy for Visual Semantic Embedding

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Sharpness-Aware Minimization for Efficiently Improving Generalization

Contrastive Learning of Medical Visual Representations from Paired Images and Text

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

Learning Visual Representations with Caption Annotations

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

VirTex: Learning Visual Representations from Textual Annotations

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Rethinking Pre-training and Self-training

Prototypical Contrastive Learning of Unsupervised Representations

Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Meta Pseudo Labels

A Metric Learning Reality Check

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

A Simple Framework for Contrastive Learning of Visual Representations

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Big Transfer (BiT): General Visual Representation Learning

Self-Supervised Learning of Pretext-Invariant Representations

Momentum Contrast for Unsupervised Visual Representation Learning

Self-Training With Noisy Student Improves ImageNet Classification

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

UNITER: UNiversal Image-TExt Representation Learning

Visual Semantic Reasoning for Image-Text Matching

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Natural Adversarial Examples

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Fixing the train-test resolution discrepancy

Contrastive Multiview Coding

Data-Efficient Image Recognition with Contrastive Predictive Coding

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Billion-scale semi-supervised learning for image classification

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Graph-RISE: Graph-Regularized Image Semantic Embedding

Do ImageNet Classifiers Generalize to ImageNet?

Classification is a Strong Baseline for Deep Metric Learning

Findings of the Third Shared Task on Multimodal Machine Translation

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search

Exploring the Limits of Weakly Supervised Pretraining

Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Learning Visual N-Grams from Web Data

Dual Attention Networks for Multimodal Reasoning and Matching

Multi30K: Multilingual English-German Image Descriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning Visual Features from Large Weakly Supervised Data

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Microsoft COCO Captions: Data Collection and Evaluation Server

Deep visual-semantic alignments for generating image descriptions

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

GloVe: Global Vectors for Word Representation

Going deeper with convolutions

Food-101 - Mining Discriminative Components with Random Forests

SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Learning Fine-Grained Image Similarity with Deep Ranking

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

DeViSE: A Deep Visual-Semantic Embedding Model

3D Object Representations for Fine-Grained Categorization

Distributed Representations of Words and Phrases and their Compositionality

Efficient Estimation of Word Representations in Vector Space

ImageNet: A large-scale hierarchical image database

Automated Flower Classification over a Large Number of Classes

Care of aged doctors

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

The open images dataset v 4 : Unified image classification , object detection , and visual relationship detection at scale

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision