2
CoCa: Contrastive Captioners are Image-Text Foundation Models
3
Flamingo: a Visual Language Model for Few-Shot Learning
4
Unified Contrastive Learning in Image-Text-Label Space
5
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
6
SLIP: Self-supervision meets Language-Image Pre-training
7
Injecting Semantic Concepts into End-to-End Image Captioning
8
FLAVA: A Foundational Language And Vision Alignment Model
9
Grounded Language-Image Pre-training
10
Scaling Up Vision-Language Pretraining for Image Captioning
11
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
12
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
13
LiT: Zero-Shot Transfer with Locked-image text Tuning
14
FILIP: Fine-grained Interactive Language-Image Pre-Training
15
An Empirical Study of Training End-to-End Vision-and-Language Transformers
16
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
17
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
18
Pix2seq: A Language Modeling Framework for Object Detection
19
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
20
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
21
How Much Can CLIP Benefit Vision-and-Language Tasks?
22
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
23
Dynamic Head: Unifying Object Detection Heads with Attentions
24
VinVL: Revisiting Visual Representations in Vision-Language Models
25
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
26
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
27
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval
28
Learning Transferable Visual Models From Natural Language Supervision
29
UniT: Multimodal Multitask Learning with a Unified Transformer
30
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
31
Unifying Vision-and-Language Tasks via Text Generation
32
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
33
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
34
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
35
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
36
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
37
VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training
38
Learning Visual Representations with Caption Annotations
39
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
40
VirTex: Learning Visual Representations from Textual Annotations
41
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
42
End-to-End Object Detection with Transformers
43
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
44
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
45
Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection
46
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
47
Objects365: A Large-Scale, High-Quality Dataset for Object Detection
48
Randaugment: Practical automated data augmentation with a reduced search space
49
UNITER: UNiversal Image-TExt Representation Learning
50
Unified Vision-Language Pre-Training for Image Captioning and VQA
51
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
52
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
53
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
54
VisualBERT: A Simple and Performant Baseline for Vision and Language
55
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
56
RoBERTa: A Robustly Optimized BERT Pretraining Approach
57
LVIS: A Dataset for Large Vocabulary Instance Segmentation
58
A Corpus for Reasoning about Natural Language Grounded in Photographs
59
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
60
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
61
MAttNet: Modular Attention Network for Referring Expression Comprehension
62
Decoupled Weight Decay Regularization
63
Focal Loss for Dense Object Detection
64
Attention is All you Need
66
Feature Pyramid Networks for Object Detection
67
Self-Critical Sequence Training for Image Captioning
68
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
69
Modeling Context in Referring Expressions
70
SPICE: Semantic Propositional Image Caption Evaluation
71
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
72
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
73
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
74
VQA: Visual Question Answering
75
Deep visual-semantic alignments for generating image descriptions
76
CIDEr: Consensus-based image description evaluation
77
ReferItGame: Referring to Objects in Photographs of Natural Scenes
78
Sequence to Sequence Learning with Neural Networks
79
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
80
Microsoft COCO: Common Objects in Context
82
Im2Text: Describing Images Using 1 Million Captioned Photographs
83
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
84
Bleu: a Method for Automatic Evaluation of Machine Translation
85
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
86
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
87
Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
88
Text Generation by Learning from Demonstrations
89
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
90
nocaps: novel object captioning at scale
91
(a) Did you state the full set of assumptions of all theoretical results
92
Did you discuss any potential negative societal impacts of your work? [Yes] See Section 5
93
Checklist 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
94
Have you read the ethics review guidelines and ensured that your paper conforms to them
95
Model Shot PascalVOC AerialDrone Aquarium Rabbits EgoHands Mushrooms Packages Raccoon Shellfish Vehicles Pistols Pothole Thermal Avg
96
a) Did you include the full text of instructions given to participants and screenshots, if applicable?
97
Did you include the total amount of compute and the type of resources used
98
Did you specify all the training details
99
Towards general purpose vision systems
100
If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots
101
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
102
c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?
103
b) Did you describe any potential participant risks, with links to Institutional Review