1
Larger language models do in-context learning differently
2
PaLM-E: An Embodied Multimodal Language Model
3
Language Is Not All You Need: Aligning Perception with Language Models
4
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
5
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
6
Scaling Vision Transformers to 22 Billion Parameters
7
DePlot: One-shot visual language reasoning by plot-to-table translation
8
Structured Prompting: Scaling In-Context Learning to 1, 000 Examples
9
VindLU: A Recipe for Effective Video-and-Language Pretraining
10
Unifying Vision, Text, and Layout for Universal Document Processing
11
Underspecification in Scene Description-to-Depiction Tasks
12
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
13
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus
14
PaLI: A Jointly-Scaled Multilingual Language-Image Model
15
PreSTU: Pre-Training for Scene-Text Understanding
16
Pre-training image-language transformers for open-vocabulary tasks
17
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
18
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
19
GIT: A Generative Image-to-text Transformer for Vision and Language
20
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset
21
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
22
Simple Open-Vocabulary Object Detection with Vision Transformers
23
UL2: Unifying Language Learning Paradigms
24
All You May Need for VQA are Image Captions
25
Flamingo: a Visual Language Model for Few-Shot Learning
26
PaLM: Scaling Language Modeling with Pathways
27
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
28
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers
29
Fairness Indicators for Systematic Assessments of Visual Feature Extractors
30
Chain of Thought Prompting Elicits Reasoning in Large Language Models
31
End-to-end Generative Pretraining for Multimodal Video Captioning
32
LaTr: Layout-Aware Transformer for Scene-Text VQA
33
RegionCLIP: Region-based Language-Image Pretraining
34
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
35
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
36
Meta-learning via Language Model In-context Tuning
37
Vector-quantized Image Modeling with Improved VQGAN
38
Pix2seq: A Language Modeling Framework for Object Detection
39
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
40
End-to-End Dense Video Captioning with Parallel Decoding
41
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
42
Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model
43
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions
44
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
45
A Step Toward More Inclusive People Annotations for Fairness
46
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
47
FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation
48
Uncovering the Bias in Facial Expressions
49
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
50
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements
51
DocVQA: A Dataset for VQA on Document Images
52
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
53
Are we done with ImageNet?
54
Language Models are Few-Shot Learners
55
Revisiting Modulated Convolutions for Visual Counting and Beyond
56
TextCaps: a Dataset for Image Captioning with Reading Comprehension
57
Captioning Images Taken by People Who Are Blind
58
Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation
59
OCR-VQA: Visual Question Answering by Reading Text in Images
60
Natural Adversarial Examples
61
Does Object Recognition Work for Everyone?
62
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
63
Annotating Objects and Relations in User-Generated Videos
64
LVIS: A Dataset for Large Vocabulary Instance Segmentation
65
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
66
Scene Text Visual Question Answering
67
Learning Robust Global Representations by Penalizing Local Predictive Power
68
Towards VQA Models That Can Read
69
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
70
Do ImageNet Classifiers Generalize to ImageNet?
71
TallyQA: Answering Complex Counting Questions
72
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
73
Gender Bias in Coreference Resolution
74
Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems
75
Women also Snowboard: Overcoming Bias in Captioning Models
76
VizWiz Grand Challenge: Answering Visual Questions from Blind People
77
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
78
Video Question Answering via Gradually Refined Attention over Appearance and Motion
79
Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints
80
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
81
Semantics derived automatically from language corpora contain human-like biases
82
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
83
A Diagram is Worth a Dozen Images
84
ActivityNet: A large-scale video benchmark for human activity understanding
85
Deep visual-semantic alignments for generating image descriptions
86
Deep Learning Face Attributes in the Wild
87
A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input
88
Fairness through awareness
89
ImageNet: A large-scale hierarchical image database
90
Palm 2 technical report, 2023
91
nocaps: novel object captioning at scale
92
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
94
The misgendering machines: Trans/hci implications of automatic gender recognition
95
Densecaptioning events in videos
96
A Survey on Context Learning