Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (2022-06-15T00:00:00.000000Z)

TL;DR

This work presents FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks and provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data.

Abstract

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.

Authors

References103 items

Board

CoCa: Contrastive Captioners are Image-Text Foundation Models

Flamingo: a Visual Language Model for Few-Shot Learning

Unified Contrastive Learning in Image-Text-Label Space

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

SLIP: Self-supervision meets Language-Image Pre-training

Injecting Semantic Concepts into End-to-End Image Captioning

FLAVA: A Foundational Language And Vision Alignment Model

Grounded Language-Image Pre-training

Scaling Up Vision-Language Pretraining for Image Captioning

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

LiT: Zero-Shot Transfer with Locked-image text Tuning

FILIP: Fine-grained Interactive Language-Image Pre-Training

An Empirical Study of Training End-to-End Vision-and-Language Transformers

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Pix2seq: A Language Modeling Framework for Object Detection

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

How Much Can CLIP Benefit Vision-and-Language Tasks?

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Dynamic Head: Unifying Object Detection Heads with Attentions

VinVL: Revisiting Visual Representations in Vision-Language Models

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval

Learning Transferable Visual Models From Natural Language Supervision

UniT: Multimodal Multitask Learning with a Unified Transformer

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Unifying Vision-and-Language Tasks via Text Generation

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training

Learning Visual Representations with Caption Annotations

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

VirTex: Learning Visual Representations from Textual Annotations

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

End-to-End Object Detection with Transformers

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Objects365: A Large-Scale, High-Quality Dataset for Object Detection

Randaugment: Practical automated data augmentation with a reduced search space

UNITER: UNiversal Image-TExt Representation Learning

Unified Vision-Language Pre-Training for Image Captioning and VQA

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

VisualBERT: A Simple and Performant Baseline for Vision and Language

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

RoBERTa: A Robustly Optimized BERT Pretraining Approach

LVIS: A Dataset for Large Vocabulary Instance Segmentation

A Corpus for Reasoning about Natural Language Grounded in Photographs

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

MAttNet: Modular Attention Network for Referring Expression Comprehension

Decoupled Weight Decay Regularization

Focal Loss for Dense Object Detection

Attention is All you Need

Mask R-CNN

Feature Pyramid Networks for Object Detection

Self-Critical Sequence Training for Image Captioning

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Modeling Context in Referring Expressions

SPICE: Semantic Propositional Image Caption Evaluation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

VQA: Visual Question Answering

Deep visual-semantic alignments for generating image descriptions

CIDEr: Consensus-based image description evaluation

ReferItGame: Referring to Objects in Photographs of Natural Scenes

Sequence to Sequence Learning with Neural Networks

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Microsoft COCO: Common Objects in Context

“OSCAR”

Im2Text: Describing Images Using 1 Million Captioned Photographs

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Bleu: a Method for Automatic Evaluation of Machine Translation

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Text Generation by Learning from Demonstrations

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

nocaps: novel object captioning at scale

(a) Did you state the full set of assumptions of all theoretical results

Did you discuss any potential negative societal impacts of your work? [Yes] See Section 5

Checklist 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?

Have you read the ethics review guidelines and ensured that your paper conforms to them

Model Shot PascalVOC AerialDrone Aquarium Rabbits EgoHands Mushrooms Packages Raccoon Shellfish Vehicles Pistols Pothole Thermal Avg

a) Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you include the total amount of compute and the type of resources used

Did you specify all the training details

Towards general purpose vision systems

100

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots

101

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

102

c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

103

b) Did you describe any potential participant risks, with links to Institutional Review