1
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
2
Unified Contrastive Learning in Image-Text-Label Space
3
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
4
Pathways: Asynchronous Distributed Dataflow for ML
5
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
6
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
7
Masked Feature Prediction for Self-Supervised Visual Pre-Training
8
Co-training Transformer with Videos and Images Improves Action Recognition
9
FLAVA: A Foundational Language And Vision Alignment Model
10
Scaling Up Vision-Language Pretraining for Image Captioning
11
Florence: A New Foundation Model for Computer Vision
12
SimMIM: a Simple Framework for Masked Image Modeling
13
LiT: Zero-Shot Transfer with Locked-image text Tuning
14
Masked Autoencoders Are Scalable Vision Learners
15
FILIP: Fine-grained Interactive Language-Image Pre-Training
16
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
17
An Empirical Study of Training End-to-End Vision-and-Language Transformers
18
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
19
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
20
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
21
On the Opportunities and Risks of Foundation Models
22
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
23
How Much Can CLIP Benefit Vision-and-Language Tasks?
24
The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning
25
Multimodal Few-Shot Learning with Frozen Language Models
26
BEiT: BERT Pre-Training of Image Transformers
27
CoAtNet: Marrying Convolution and Attention for All Data Sizes
28
Scaling Vision Transformers
29
VinVL: Revisiting Visual Representations in Vision-Language Models
30
GSPMD: General and Scalable Parallelization for ML Computation Graphs
31
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
32
ViViT: A Video Vision Transformer
33
MoViNets: Mobile Video Networks for Efficient Video Recognition
34
Learning Transferable Visual Models From Natural Language Supervision
35
A Straightforward Framework For Video Retrieval Using CLIP
36
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
37
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
38
Unifying Vision-and-Language Tasks via Text Generation
39
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
40
A Short Note on the Kinetics-700-2020 Human Action Dataset
41
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
42
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
43
Language Models are Few-Shot Learners
44
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training
46
5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
47
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
48
UNITER: UNiversal Image-TExt Representation Learning
49
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
50
Natural Adversarial Examples
51
A Short Note on the Kinetics-700 Human Action Dataset
52
Learning Robust Global Representations by Penalizing Local Predictive Power
53
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
54
Do ImageNet Classifiers Generalize to ImageNet?
55
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
56
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
57
A Corpus for Reasoning about Natural Language Grounded in Photographs
58
A Short Note about Kinetics-600
59
Exploring the Limits of Weakly Supervised Pretraining
60
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
61
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
62
Moments in Time Dataset: One Million Videos for Event Understanding
63
Decoupled Weight Decay Regularization
64
Attention is All you Need
65
The Kinetics Human Action Video Dataset
66
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
67
Self-Critical Sequence Training for Image Captioning
68
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
69
Deep Residual Learning for Image Recognition
70
Neural Machine Translation of Rare Words with Subword Units
71
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
72
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
73
Microsoft COCO Captions: Data Collection and Evaluation Server
74
Show and tell: A neural image caption generator
75
Fully convolutional networks for semantic segmentation
76
Two-Stream Convolutional Networks for Action Recognition in Videos
77
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
78
ImageNet classification with deep convolutional neural networks
79
ImageNet: A large-scale hierarchical image database
80
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
81
Combined Scaling for Open-Vocabulary Image Classification
82
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
83
Fred Bertsch, and Anelia Angelova
84
CoCa is an encoder-decoder model and the final decoder outputs can be used for multimodal understanding/generation
85
nocaps: novel object captioning at scale
86
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
87
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
88
Set transformer: A framework for attention-based permutation-invariant neural networks