Distilled Dual-Encoder Model for Vision-Language Understanding

Published in

Conference on Empirical Methods in Natural Lang...(2021)

External Links:

Generate Graph DownloadPDF

TL;DR

DiDE, a framework that distills the knowledge of the fusion-encoder teacher model into the dual-encoding student model, and encourages the student not only to mimic the predictions of teacher, but also to calculate the cross-modal attention distributions and align with the teacher.

Abstract

On vision-language understanding (VLU) tasks, fusion-encoder vision-language models achieve superior results but sacrifice efficiency because of the simultaneous encoding of images and text. On the contrary, the dual encoder model that separately encodes images and text has the advantage in efficiency, while failing on VLU tasks due to the lack of deep cross-modal interactions. To get the best of both worlds, we propose DiDE, a framework that distills the knowledge of the fusion-encoder teacher model into the dual-encoder student model. Since the cross-modal interaction is the key to the superior performance of teacher model but is absent in the student model, we encourage the student not only to mimic the predictions of teacher, but also to calculate the cross-modal attention distributions and align with the teacher. Experimental results demonstrate that DiDE is competitive with the fusion-encoder teacher model in performance (only a 1% drop) while enjoying 4 times faster inference. Further analyses reveal that the proposed cross-modal attention distillation is crucial to the success of our framework.

Authors

Bing Qin

12 papers

Ming Liu

3 papers

Zekun Wang

2 papers

References84 items

Flamingo: a Visual Language Model for Few-Shot Learning

SLIP: Self-supervision meets Language-Image Pre-training

A Survey on Green Deep Learning

An Empirical Study of Training End-to-End Vision-and-Language Transformers

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Distilled Dual-Encoder Model for Vision-Language Understanding

Published in

Conference on Empirical Methods in Natural Lang...(2021)

External Links:

Generate Graph DownloadPDF

TL;DR

Abstract

Authors

Bing Qin

12 papers

Ming Liu

3 papers

Zekun Wang

2 papers

References84 items

Flamingo: a Visual Language Model for Few-Shot Learning

SLIP: Self-supervision meets Language-Image Pre-training

A Survey on Green Deep Learning

An Empirical Study of Training End-to-End Vision-and-Language Transformers

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Wenhui Wang

6 papers

Haichao Zhu

1 papers

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

How Much Can CLIP Benefit Vision-and-Language Tasks?

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders

BEiT: BERT Pre-Training of Image Transformers

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

VinVL: Revisiting Visual Representations in Vision-Language Models

Compressing Visual-linguistic Model via Knowledge Distillation

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Learning Transferable Visual Models From Natural Language Supervision

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

Training data-efficient image transformers & distillation through attention

MiniVLM: A Smaller and Faster Vision-Language Model

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Unsupervised Cross-lingual Representation Learning at Scale

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Randaugment: Practical automated data augmentation with a reduced search space

UNITER: UNiversal Image-TExt Representation Learning

Unified Vision-Language Pre-Training for Image Captioning and VQA

Cross-Lingual Natural Language Generation via Pre-Training

TinyBERT: Distilling BERT for Natural Language Understanding

Patient Knowledge Distillation for BERT Model Compression

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

VisualBERT: A Simple and Performant Baseline for Vision and Language

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Unified Language Model Pre-training for Natural Language Understanding and Generation

Cross-lingual Language Model Pretraining

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

A Corpus for Reasoning about Natural Language Grounded in Photographs

Representation Learning with Contrastive Predictive Coding

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Attention is All you Need

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Distilling the Knowledge in a Neural Network

Adam: A Method for Stochastic Optimization

FitNets: Hints for Thin Deep Nets

Deep visual-semantic alignments for generating image descriptions

Microsoft COCO: Common Objects in Context

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Im2Text: Describing Images Using 1 Million Captioned Photographs

Exploring the Limits

Information Processing Systems

8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving

For the downstream fine-tuning, we follow most of the hyperparameters in Kim et al. (2021). We fine-tune the model for 10 epochs with a batch size of 256 for VQA

Improving Language Understanding by Generative Pre-Training

Smith , and Oren Etzioni . 2020

Proceedings of a meeting held 12-14 December 2011

the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing

2020 - 16th European Conference, Glasgow, UK, August 23-28

model is trained for 20 epochs with a batch size of 1024 . We apply RandAugment (Cubuk et al., 2020) without color inversion and cutout

Field of Study

Computer Science

Journal Information

Name

ArXiv

Volume

abs/2005.00687

Venue Information

Name

Conference on Empirical Methods in Natural Language Processing

Type

conference

URL

https://www.aclweb.org/portal/emnlp

Alternate Names

Empir Method Nat Lang Process
Empirical Methods in Natural Language Processing
Conf Empir Method Nat Lang Process
EMNLP

TL;DR

Abstract

Authors

References84 items

Flamingo: a Visual Language Model for Few-Shot Learning

SLIP: Self-supervision meets Language-Image Pre-training

A Survey on Green Deep Learning

An Empirical Study of Training End-to-End Vision-and-Language Transformers

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

TL;DR

Abstract

Authors

References84 items

Flamingo: a Visual Language Model for Few-Shot Learning

SLIP: Self-supervision meets Language-Image Pre-training

A Survey on Green Deep Learning

An Empirical Study of Training End-to-End Vision-and-Language Transformers

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

How Much Can CLIP Benefit Vision-and-Language Tasks?

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders

BEiT: BERT Pre-Training of Image Transformers

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

VinVL: Revisiting Visual Representations in Vision-Language Models

Compressing Visual-linguistic Model via Knowledge Distillation

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Learning Transferable Visual Models From Natural Language Supervision

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

Training data-efficient image transformers & distillation through attention

MiniVLM: A Smaller and Faster Vision-Language Model

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Unsupervised Cross-lingual Representation Learning at Scale

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Randaugment: Practical automated data augmentation with a reduced search space

UNITER: UNiversal Image-TExt Representation Learning

Unified Vision-Language Pre-Training for Image Captioning and VQA

Cross-Lingual Natural Language Generation via Pre-Training

TinyBERT: Distilling BERT for Natural Language Understanding

Patient Knowledge Distillation for BERT Model Compression

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

VisualBERT: A Simple and Performant Baseline for Vision and Language

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Green AI

Unified Language Model Pre-training for Natural Language Understanding and Generation

Cross-lingual Language Model Pretraining

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

A Corpus for Reasoning about Natural Language Grounded in Photographs

Representation Learning with Contrastive Predictive Coding

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Attention is All you Need

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Distilling the Knowledge in a Neural Network

Adam: A Method for Stochastic Optimization

FitNets: Hints for Thin Deep Nets