1
Meta-Transformer: A Unified Framework for Multimodal Learning
2
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
3
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
4
ImageBind One Embedding Space to Bind Them All
5
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
6
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
7
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
8
AIM: Adapting Image Models for Efficient Video Action Recognition
9
Edge-guided Multi-domain RGB-to-TIR image Translation for Training Vision Tasks with Challenging Labels
10
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
11
Scaling Language-Image Pre-Training via Masking
12
Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
13
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks
14
Learning Audio-Video Modalities from Image Captions
15
Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth
16
PointCLIP: Point Cloud Understanding by CLIP
17
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
18
Masked Autoencoders Are Scalable Vision Learners
19
Ego4D: Around the World in 3,000 Hours of Egocentric Video
20
LLVIP: A Visible-infrared Paired Dataset for Low-light Vision
21
LoRA: Low-Rank Adaptation of Large Language Models
22
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
23
Learning Transferable Visual Models From Natural Language Supervision
24
A Straightforward Framework For Video Retrieval Using CLIP
25
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
26
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
27
Rethinking CNN Models for Audio Classification
28
ActBERT: Learning Global-Local Video-Text Representations
29
Vggsound: A Large-Scale Audio-Visual Dataset
30
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
31
MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding
32
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
33
AudioCaps: Generating Captions for Audios in The Wild
34
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
35
Convolutional Neural Networks for Static and Dynamic Breast Infrared Imaging Classification
36
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
37
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
38
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos
39
Audio-Visual Event Localization in Unconstrained Videos
40
Localizing Moments in Video with Natural Language
41
Attention is All you Need
42
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
43
The Kinetics Human Action Video Dataset
44
Audio Set: An ontology and human-labeled dataset for audio events
45
YouTube-8M: A Large-Scale Video Classification Benchmark
46
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
47
Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks
48
Content-Based Video Recommendation System Based on Stylistic Visual Features
49
Deep Residual Learning for Image Recognition
50
ESC: Dataset for Environmental Sound Classification
51
ActivityNet: A large-scale video benchmark for human activity understanding
52
Interactive intrinsic video editing
53
Large-Scale Video Classification with Convolutional Neural Networks
54
Two-Stream Convolutional Networks for Action Recognition in Videos
55
Microsoft COCO: Common Objects in Context
56
Freesound technical demo
57
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
58
Indoor Segmentation and Support Inference from RGBD Images
59
HMDB: A large video database for human motion recognition
60
Collecting Highly Parallel Data for Paraphrase Evaluation
61
ImageNet: A large-scale hierarchical image database
62
Recognizing human actions: a local SVM approach
63
Simplifying video editing using metadata
64
Image and video search engine for the World Wide Web
65
Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
66
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
67
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
69
Under review as a conference paper at ICLR 2016
70
Free teledyne flir thermal dataset for algorithm training
71
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
72
2012) to validate by 654 test samples. Through preprocessing, we constrained the depth images to a maximum depth of 10 meters. Following ImageBind, we undertook a category reorganization process
73
2016) comprises 10K YouTube videos, each paired by 200K captions
74
We validate the zero-shot classification capability with the ESC-50 (Piczak, 2015) dataset, which has 2000 test audios, each uniquely labelled. For zero-shot retrieval
75
D LICENSE Unless explicitly noted otherwise, our released datasets are provided to users under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License