1
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
2
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges
3
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
4
NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
5
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
6
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
7
1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition (CVPR 2022)
8
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
9
Masked Autoencoders As Spatiotemporal Learners
10
CoCa: Contrastive Captioners are Image-Text Foundation Models
11
Flamingo: a Visual Language Model for Few-Shot Learning
12
Unified Contrastive Learning in Image-Text-Label Space
13
MultiMAE: Multi-modal Multi-task Masked Autoencoders
14
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
15
Pathways: Asynchronous Distributed Dataflow for ML
16
All in One: Exploring Unified Video-Language Pre-Training
17
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation
18
ActionFormer: Localizing Moments of Actions with Transformers
19
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
20
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
21
Multiview Transformers for Video Recognition
22
MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound
23
Masked Feature Prediction for Self-Supervised Visual Pre-Training
24
BEVT: BERT Pretraining of Video Transformers
25
Scaling Up Vision-Language Pretraining for Image Captioning
26
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
27
Florence: A New Foundation Model for Computer Vision
28
INTERN: A New Learning Paradigm Towards General Vision
29
Masked Autoencoders Are Scalable Vision Learners
30
FILIP: Fine-grained Interactive Language-Image Pre-Training
31
An Empirical Study of Training End-to-End Vision-and-Language Transformers
32
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
33
History Aware Multimodal Transformer for Vision-and-Language Navigation
34
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
35
ActionCLIP: A New Paradigm for Video Action Recognition
36
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
37
Robust fine-tuning of zero-shot models
38
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
39
On the Opportunities and Risks of Foundation Models
40
EAN: Event Adaptive Network for Enhanced Action Recognition
41
Evidential Deep Learning for Open Set Action Recognition
42
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
43
How Much Can CLIP Benefit Vision-and-Language Tasks?
45
BEiT: BERT Pre-Training of Image Transformers
46
Relation Modeling in Spatio-Temporal Action Localization
47
MERLOT: Multimodal Neural Script Knowledge Models
48
FineAction: A Fine-Grained Video Dataset for Temporal Action Localization
49
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
50
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
51
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
52
ViViT: A Video Vision Transformer
53
Temporal Context Aggregation Network for Temporal Action Proposal Refinement
54
Learning Transferable Visual Models From Natural Language Supervision
55
Zero-Shot Text-to-Image Generation
56
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
57
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
58
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks
59
Exploring Simple Siamese Representation Learning
60
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
61
Learning Open Set Network with Discriminative Reciprocal Points
62
Generative Pretraining From Pixels
63
Learn to cycle: Time-consistent feature discovery for action recognition
64
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization
65
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
66
ActBERT: Learning Global-Local Video-Text Representations
67
Augment Your Batch: Improving Generalization Through Instance Repetition
68
The AVA-Kinetics Localized Human Actions Video Dataset
69
Asynchronous Interaction Aggregation for Action Detection
70
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
71
A Simple Framework for Contrastive Learning of Visual Representations
72
Learning Spatiotemporal Features via Video and Text Pair Discrimination
73
Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors
74
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
75
Momentum Contrast for Unsupervised Visual Representation Learning
76
Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding
77
BMN: Boundary-Matching Network for Temporal Action Proposal Generation
78
A Short Note on the Kinetics-700 Human Action Dataset
79
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
80
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
81
VideoBERT: A Joint Model for Video and Language Representation Learning
82
SlowFast Networks for Video Recognition
83
BAR: Bayesian Activity Recognition using variational inference
84
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
85
A Short Note about Kinetics-600
86
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
87
Unsupervised Feature Learning via Non-parametric Instance Discrimination
88
HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
89
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
90
Localizing Moments in Video with Natural Language
91
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense
92
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
93
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
95
Deep Learning for Video Classification and Captioning
96
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
97
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
98
TGIF: A New Dataset and Benchmark on Animated GIF Description
99
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
100
Colorful Image Colorization
101
Towards Open Set Deep Networks
102
ActivityNet: A large-scale video benchmark for human activity understanding
103
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
104
Unsupervised Visual Representation Learning by Context Prediction
105
A dataset for Movie Description
106
Microsoft COCO: Common Objects in Context
107
HMDB: A large video database for human motion recognition
108
Collecting Highly Parallel Data for Paraphrase Evaluation
109
Unsupervised Learning of Visual Representations using Videos
110
Computer Vision and Image Understanding