1
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training
2
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
3
Learning Transferable Visual Models From Natural Language Supervision
4
Learning the Best Pooling Strategy for Visual Semantic Embedding
5
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
6
Sharpness-Aware Minimization for Efficiently Improving Generalization
7
Contrastive Learning of Medical Visual Representations from Paired Images and Text
8
Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders
9
Learning Visual Representations with Caption Annotations
10
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
11
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
12
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
13
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
14
VirTex: Learning Visual Representations from Textual Annotations
15
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
16
Rethinking Pre-training and Self-training
17
Prototypical Contrastive Learning of Unsupervised Representations
18
Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO
19
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
20
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
22
A Metric Learning Reality Check
23
5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
24
A Simple Framework for Contrastive Learning of Visual Representations
25
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
26
Big Transfer (BiT): General Visual Representation Learning
27
Self-Supervised Learning of Pretext-Invariant Representations
28
Momentum Contrast for Unsupervised Visual Representation Learning
29
Self-Training With Noisy Student Improves ImageNet Classification
30
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
31
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
32
UNITER: UNiversal Image-TExt Representation Learning
33
Visual Semantic Reasoning for Image-Text Matching
34
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
35
RoBERTa: A Robustly Optimized BERT Pretraining Approach
36
Natural Adversarial Examples
37
XLNet: Generalized Autoregressive Pretraining for Language Understanding
38
Fixing the train-test resolution discrepancy
39
Contrastive Multiview Coding
40
Data-Efficient Image Recognition with Contrastive Predictive Coding
41
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
42
Billion-scale semi-supervised learning for image classification
43
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
44
Graph-RISE: Graph-Regularized Image Semantic Embedding
45
Do ImageNet Classifiers Generalize to ImageNet?
46
Classification is a Strong Baseline for Deep Metric Learning
47
Findings of the Third Shared Task on Multimodal Machine Translation
48
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
49
Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search
50
Exploring the Limits of Weakly Supervised Pretraining
51
Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description
52
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
53
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
54
Learning Visual N-Grams from Web Data
55
Dual Attention Networks for Multimodal Reasoning and Matching
56
Multi30K: Multilingual English-German Image Descriptions
57
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
58
Learning Visual Features from Large Weakly Supervised Data
59
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
60
Microsoft COCO Captions: Data Collection and Evaluation Server
61
Deep visual-semantic alignments for generating image descriptions
62
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
63
GloVe: Global Vectors for Word Representation
64
Going deeper with convolutions
65
Food-101 - Mining Discriminative Components with Random Forests
66
SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation
67
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
68
Grounded Compositional Semantics for Finding and Describing Images with Sentences
69
Learning Fine-Grained Image Similarity with Deep Ranking
70
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
71
DeViSE: A Deep Visual-Semantic Embedding Model
72
3D Object Representations for Fine-Grained Categorization
73
Distributed Representations of Words and Phrases and their Compositionality
74
Efficient Estimation of Word Representations in Vector Space
75
ImageNet: A large-scale hierarchical image database
76
Automated Flower Classification over a Large Number of Classes
78
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
79
Language Models are Unsupervised Multitask Learners
80
The open images dataset v 4 : Unified image classification , object detection , and visual relationship detection at scale
81
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision