ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Published in

Neural Information Processing Systems(2019)

External Links:

Generate Graph

TL;DR

ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

Abstract

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

Authors

Devi Parikh

32 papers

Dhruv Batra

43 papers

Stefan Lee

8 papers

References51 items

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Unified Vision-Language Pre-Training for Image Captioning and VQA

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Published in

Neural Information Processing Systems(2019)

External Links:

Generate Graph

TL;DR

Abstract

Authors

Devi Parikh

32 papers

Dhruv Batra

43 papers

Stefan Lee

8 papers

References51 items

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Unified Vision-Language Pre-Training for Image Captioning and VQA

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

VisualBERT: A Simple and Performant Baseline for Vision and Language

VideoBERT: A Joint Model for Video and Language Representation Learning

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

Cross-lingual Language Model Pretraining

Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

From Recognition to Cognition: Visual Commonsense Reasoning

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Stacked Cross Attention for Image-Text Matching

Deep Contextualized Word Representations

MAttNet: Modular Attention Network for Referring Expression Comprehension

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

Embodied Question Answering

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

ShapeCodes: Self-supervised Feature Learning by Lifting Views to Viewgrids

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Modulating early visual processing by language

Attention is All you Need

Look, Listen and Learn

FOIL it! Find One mismatch between Image and Language caption

Colorization as a Proxy Task for Visual Understanding

Learning Features by Watching Objects Move

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Context Encoders: Feature Learning by Inpainting

Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification

Colorful Image Colorization

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Deep Residual Learning for Image Recognition

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Unsupervised Visual Representation Learning by Context Prediction

Learning Image Representations Tied to Ego-Motion

VQA: Visual Question Answering

Microsoft COCO Captions: Data Collection and Evaluation Server

ReferItGame: Referring to Objects in Photographs of Natural Scenes

ImageNet Large Scale Visual Recognition Challenge

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

One billion word benchmark for measuring progress in statistical language modeling

Book Review: Mind as machine: a history of cognitive science

Mind As Machine: A History of Cognitive Science Two-Volume Set

nocaps: novel object captioning at scale

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

English wikipedia

Improving language understanding with unsupervised learning

Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks

Field of Study

Computer Science

Venue Information

Name

Neural Information Processing Systems

Type

conference

URL

http://neurips.cc/

Alternate Names

Neural Inf Process Syst
NeurIPS
NIPS

TL;DR

Abstract

Authors

References51 items

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Unified Vision-Language Pre-Training for Image Captioning and VQA

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

TL;DR

Abstract

Authors

References51 items

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Unified Vision-Language Pre-Training for Image Captioning and VQA

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

VisualBERT: A Simple and Performant Baseline for Vision and Language

VideoBERT: A Joint Model for Video and Language Representation Learning

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

Cross-lingual Language Model Pretraining

Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

From Recognition to Cognition: Visual Commonsense Reasoning

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Stacked Cross Attention for Image-Text Matching

Deep Contextualized Word Representations

MAttNet: Modular Attention Network for Referring Expression Comprehension

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

Embodied Question Answering

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

ShapeCodes: Self-supervised Feature Learning by Lifting Views to Viewgrids

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Modulating early visual processing by language

Attention is All you Need

Look, Listen and Learn

FOIL it! Find One mismatch between Image and Language caption

Mask R-CNN

Colorization as a Proxy Task for Visual Understanding

Learning Features by Watching Objects Move

Visual Dialog

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Context Encoders: Feature Learning by Inpainting

Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification

Colorful Image Colorization

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Deep Residual Learning for Image Recognition

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Unsupervised Visual Representation Learning by Context Prediction

Learning Image Representations Tied to Ego-Motion

VQA: Visual Question Answering

Microsoft COCO Captions: Data Collection and Evaluation Server

ReferItGame: Referring to Objects in Photographs of Natural Scenes

ImageNet Large Scale Visual Recognition Challenge

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

One billion word benchmark for measuring progress in statistical language modeling

Book Review: Mind as machine: a history of cognitive science

Mind As Machine: A History of Cognitive Science Two-Volume Set

nocaps: novel object captioning at scale

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

English wikipedia

Improving language understanding with unsupervised learning

Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks

Field of Study

Venue Information

Name

Type

URL

Alternate Names