1
STAR: A Benchmark for Situated Reasoning in Real-World Videos
2
Hierarchical Text-Conditional Image Generation with CLIP Latents
3
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
4
PaLM: Scaling Language Modeling with Pathways
5
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
6
Transformer Language Models without Positional Encodings Still Learn Positional Information
7
Training Compute-Optimal Large Language Models
8
Teaching language models to support answers with verified quotes
9
All in One: Exploring Unified Video-Language Pre-Training
10
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
11
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
12
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
13
Red Teaming Language Models with Language Models
14
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
15
LaMDA: Language Models for Dialog Applications
16
CM3: A Causal Masked Multimodal Model of the Internet
17
ZeroPrompt: Scaling Prompt-Based Pretraining to 1, 000 Tasks Improves Zero-Shot Generalization
18
Multiview Transformers for Video Recognition
19
MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound
20
KAT: A Knowledge Augmented Transformer for Vision-and-Language
21
VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
22
MAGMA - Multimodal Augmentation of Generative Models through Adapter-based Finetuning
23
FLAVA: A Foundational Language And Vision Alignment Model
24
Ethical and social risks of harm from Language Models
25
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
26
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
27
Scaling Up Vision-Language Pretraining for Image Captioning
28
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
29
Florence: A New Foundation Model for Computer Vision
30
Combined Scaling for Zero-shot Transfer Learning
31
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
32
ClipCap: CLIP Prefix for Image Captioning
33
Achieving Human Parity on Visual Question Answering
34
LiT: Zero-Shot Transfer with Locked-image text Tuning
35
FILIP: Fine-grained Interactive Language-Image Pre-Training
36
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
37
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
38
Multitask Prompted Training Enables Zero-Shot Task Generalization
39
Pix2seq: A Language Modeling Framework for Object Detection
40
Primer: Searching for Efficient Transformers for Language Modeling
41
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
42
MURAL: Multimodal, Multitask Retrieval Across Languages
43
Finetuned Language Models Are Zero-Shot Learners
44
Learning to Prompt for Vision-Language Models
45
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
46
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
47
Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs
48
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
49
Multimodal Few-Shot Learning with Frozen Language Models
50
Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model
51
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
52
Understanding and Evaluating Racial Biases in Image Captioning
53
Scaling Vision Transformers
54
MERLOT: Multimodal Neural Script Knowledge Models
55
VinVL: Revisiting Visual Representations in Vision-Language Models
56
True Few-Shot Learning with Language Models
57
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
58
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions
59
The Power of Scale for Parameter-Efficient Prompt Tuning
60
ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition
61
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
62
Perceiver: General Perception with Iterative Attention
63
Learning Transferable Visual Models From Natural Language Supervision
64
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
65
Calibrate Before Use: Improving Few-Shot Performance of Language Models
66
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
67
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
68
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
69
High-Performance Large-Scale Image Recognition Without Normalization
70
Unifying Vision-and-Language Tasks via Text Generation
71
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
72
What Makes Good In-Context Examples for GPT-3?
73
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
74
A Multimodal Framework for the Detection of Hateful Memes
75
Enhance Multimodal Transformer With External Label And In-Domain Pretrain: Hateful Meme Challenge Winning Solution
76
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
77
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
78
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
79
A Short Note on the Kinetics-700-2020 Human Action Dataset
80
RareAct: A video dataset of unusual interactions
81
CrossTransformers: spatially-aware few-shot transfer
82
Self-Supervised MultiModal Versatile Networks
83
VirTex: Learning Visual Representations from Textual Annotations
84
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
85
ActBERT: Learning Global-Local Video-Text Representations
86
Language Models are Few-Shot Learners
87
End-to-End Object Detection with Transformers
88
The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes
89
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training
90
VD-BERT: A Unified Vision and Dialog Transformer with BERT
91
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
92
Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need?
93
ReZero is All You Need: Fast Convergence at Large Depth
94
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
95
Scaling Laws for Neural Language Models
96
Diagnosing Gender Bias in Image Recognition Systems
97
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
98
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
99
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
100
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
101
UNITER: UNiversal Image-TExt Representation Learning
102
Unified Vision-Language Pre-Training for Image Captioning and VQA
103
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
104
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
105
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
106
Attention on Attention for Image Captioning
107
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
108
Fast and Flexible Multi-Task Classification Using Conditional Neural Adaptive Processes
109
Fixing the train-test resolution discrepancy
110
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
111
Does Object Recognition Work for Everyone?
112
Energy and Policy Considerations for Deep Learning in NLP
113
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
114
Towards VQA Models That Can Read
115
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
116
VideoBERT: A Joint Model for Video and Language Representation Learning
117
Doing more with less: meta-reasoning and meta-learning in humans and machines
118
Parameter-Efficient Transfer Learning for NLP
119
Fast Context Adaptation via Meta-Learning
120
Model Cards for Model Reporting
121
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
122
Meta-Learning Probabilistic Inference for Prediction
123
Meta-learning with differentiable closed-form solvers
124
Gender Bias in Coreference Resolution
125
Women also Snowboard: Overcoming Bias in Captioning Models
126
Datasheets for datasets
127
VizWiz Grand Challenge: Answering Visual Questions from Blind People
128
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
129
Universal Language Model Fine-tuning for Text Classification
130
Video Question Answering via Gradually Refined Attention over Appearance and Motion
131
Attention is All you Need
132
Prototypical Networks for Few-shot Learning
133
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
134
Towards Automatic Learning of Procedures From Web Instructional Videos
135
Self-Critical Sequence Training for Image Captioning
137
Gaussian Error Linear Units (GELUs)
138
Learning feed-forward one-shot learners
139
Matching Networks for One Shot Learning
140
Low-Shot Visual Recognition by Shrinking and Hallucinating Features
141
Exploring the Limits of Language Modeling
142
VQA: Visual Question Answering
143
Microsoft COCO Captions: Data Collection and Evaluation Server
144
Long-term recurrent convolutional networks for visual recognition and description
145
Show and tell: A neural image caption generator
146
ImageNet Large Scale Visual Recognition Challenge
147
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
148
Generating Sequences With Recurrent Neural Networks
149
Generating Text with Recurrent Neural Networks
150
One-shot learning of object categories
151
Long Short-Term Memory
152
VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training
153
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
154
Prefix-Tuning: Optimizing Continuous Prompts for Generation
155
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
156
Enhancing textual cues in multi-modal transformers for VQA
157
Few-shot classification by recycling deep learning
158
Babuschkin. Haiku: Sonnet for JAX, 2020
159
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
160
Language Models are Unsupervised Multitask Learners
161
Vatex video cap-tioning challenge 2020: Multi-view features and hybrid reward strategies for video captioning
162
JAX: composable transformations of Python+NumPy programs, 2018
163
Optimization of image description metrics using policy gradient methods
164
YFCC100M: The new data in multimedia research
165
One shot learning of simple visual concepts
166
Recurrent neural network based language model
167
Categorization and Naming in Children: Problems of Induction
168
Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem
169
Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition
170
Metadataset: A dataset of datasets for learning
171
Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes] Our data was automatically scraped from million of webpages
172
c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
173
e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes]
174
c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?
175
(a) If your work uses existing assets, did you cite the creators? [Yes] We properly cited the prior methods on which our work is based, as well as prior datasets when appropriate
176
If you used crowdsourcing or conducted research with human subjects.
177
Did you mention the license of the assets?
178
The first icon is provided under license by Flaticon, the second image is provided under license by Unsplash, the third one is provided under license by Sketchfab
179
All visuals are sourced from various sources including the COCO dataset
180
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
181
c) Did you include any new assets either in the supplemental material or as a URL?
182
Model Figures 3, 7, 9 and 8: All images are provided under license by Unsplash