Unifying Vision-and-Language Tasks via Text Generation

Published in

International Conference on Machine Learning(2021)

External Links:

Generate Graph

TL;DR

This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.

Abstract

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j-min/VL-T5

Authors

Mohit Bansal

47 papers

Jaemin Cho

6 papers

Jie Lei

7 papers

References86 items

Learning Transferable Visual Models From Natural Language Supervision

VinVL: Making Visual Representations Matter in Vision-Language Models

Making Pre-trained Language Models Better Few-shot Learners

Eliciting Knowledge from Language Models Using Automatically Generated Prompts

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Unifying Vision-and-Language Tasks via Text Generation

Published in

International Conference on Machine Learning(2021)

External Links:

Generate Graph

TL;DR

Abstract

Authors

Mohit Bansal

47 papers

Jaemin Cho

6 papers

Jie Lei

7 papers

References86 items

Learning Transferable Visual Models From Natural Language Supervision

VinVL: Making Visual Representations Matter in Vision-Language Models

Making Pre-trained Language Models Better Few-shot Learners

Eliciting Knowledge from Language Models Using Automatically Generated Prompts

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Multimodal Transformer for Multimodal Machine Translation

ActBERT: Learning Global-Local Video-Text Representations

Language Models are Few-Shot Learners

UnifiedQA: Crossing Format Boundaries With a Single QA System

Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Document Ranking with a Pretrained Sequence-to-Sequence Model

XGPT: Cross-modal Generative Pre-Training for Image Captioning

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

12-in-1: Multi-Task Vision and Language Representation Learning

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

HuggingFace's Transformers: State-of-the-art Natural Language Processing

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

UNITER: UNiversal Image-TExt Representation Learning

Unified Vision-Language Pre-Training for Image Captioning and VQA

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Attention on Attention for Image Captioning

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

RoBERTa: A Robustly Optimized BERT Pretraining Approach

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Contrastive Bidirectional Transformer for Temporal Representation Learning

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Unifying Question Answering and Text Classification via Span Extraction

VideoBERT: A Joint Model for Video and Language Representation Learning

Probing the Need for Visual Context in Multimodal Machine Translation

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

From Recognition to Cognition: Visual Commonsense Reasoning

A Corpus for Reasoning about Natural Language Grounded in Photographs

Findings of the Third Shared Task on Multimodal Machine Translation

TVQA: Localized, Compositional Video Question Answering

The MeMAD Submission to the WMT18 Multimodal Translation Task

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

The Natural Language Decathlon: Multitask Learning as Question Answering

Bilinear Attention Networks

A Call for Clarity in Reporting BLEU Scores

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Self-Attention with Relative Position Representations

MAttNet: Modular Attention Network for Referring Expression Comprehension

Decoupled Weight Decay Regularization

Automatic differentiation in PyTorch

Mixed Precision Training

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Attention is All you Need

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Towards Automatic Learning of Procedures From Web Instructional Videos

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Using the Output Embedding to Improve Language Models

Modeling Context in Referring Expressions

SPICE: Semantic Propositional Image Caption Evaluation

SQuAD: 100,000+ Questions for Machine Comprehension of Text

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Multi30K: Multilingual English-German Image Descriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Visual7W: Grounded Question Answering in Images

Generation and Comprehension of Unambiguous Object Descriptions

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Microsoft COCO Captions: Data Collection and Evaluation Server

Deep visual-semantic alignments for generating image descriptions

CIDEr: Consensus-based image description evaluation

ReferItGame: Referring to Objects in Photographs of Natural Scenes

Microsoft COCO: Common Objects in Context

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Im2Text: Describing Images Using 1 Million Captioned Photographs

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Bleu: a Method for Automatic Evaluation of Machine Translation

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Pre-training text encoders as discriminators rather than generators

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RefCOCOg 13 We use umd split, which consists of train / val / test sets with 42,226 / 2,573 / 5,023 sentences, respectively. Following UNITER (Chen et al., 2020) and MAttNet

Collection and Evaluation

IBM Research Report Bleu: a Method for Automatic Evaluation of Machine Translation

VCR 12 Train / val / test splits consist of 212

/ 25,263 questions, respectively. We train our model on train split and use val split for validation