1
STAR: A Benchmark for Situated Reasoning in Real-World Videos
2
LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering
3
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
4
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
5
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
6
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
7
Llama 2: Open Foundation and Fine-Tuned Chat Models
8
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
9
MMBench: Is Your Multi-modal Model an All-around Player?
10
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
11
FunQA: Towards Surprising Video Comprehension
12
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
13
LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
14
Valley: Video Assistant with Large Language model Enhanced abilitY
15
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
16
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
17
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
18
Paxion: Patching Action Knowledge in Video-Language Foundation Models
19
Evaluating Object Hallucination in Large Vision-Language Models
20
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
21
Self-Chained Image-Language Model for Video Localization and Question Answering
22
VideoChat: chat-centric video understanding
23
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
24
A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension
25
Otter: A Multi-Modal Model With In-Context Instruction Tuning
26
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
27
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
28
Visual Instruction Tuning
29
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
30
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
31
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
32
EVA-CLIP: Improved Training Techniques for CLIP at Scale
33
PaLM-E: An Embodied Multimodal Language Model
34
Language Is Not All You Need: Aligning Perception with Language Models
35
LLaMA: Open and Efficient Foundation Language Models
36
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
37
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
38
MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
39
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
40
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
41
GLM-130B: An Open Bilingual Pre-trained Model
42
Video Graph Transformer for Video Question Answering
43
ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities
44
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
45
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
46
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
47
Flamingo: a Visual Language Model for Few-Shot Learning
48
PaLM: Scaling Language Modeling with Pathways
49
All in One: Exploring Unified Video-Language Pre-Training
50
Chain of Thought Prompting Elicits Reasoning in Large Language Models
51
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
52
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
53
Ego4D: Around the World in 3,000 Hours of Egocentric Video
54
Finetuned Language Models Are Zero-Shot Learners
55
LoRA: Low-Rank Adaptation of Large Language Models
56
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions
57
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
58
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
59
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
60
VisualMRC: Machine Reading Comprehension on Document Images
61
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
62
MovieNet: A Holistic Dataset for Movie Understanding
63
DocVQA: A Dataset for VQA on Document Images
64
Language Models are Few-Shot Learners
65
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
66
TextCaps: a Dataset for Image Captioning with Reading Comprehension
67
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
68
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
69
OCR-VQA: Visual Question Answering by Reading Text in Images
70
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
71
Scene Text Visual Question Answering
72
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
73
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
74
Towards VQA Models That Can Read
75
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
76
TVQA: Localized, Compositional Video Question Answering
77
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
78
Moments in Time Dataset: One Million Videos for Event Understanding
79
Video Question Answering via Gradually Refined Attention over Appearance and Motion
80
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense
81
TALL: Temporal Activity Localization via Language Query
82
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
83
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
84
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
85
A Hierarchical Approach for Generating Descriptive Image Paragraphs
86
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
87
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
88
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
89
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
90
Microsoft COCO: Common Objects in Context
91
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
92
Im2Text: Describing Images Using 1 Million Captioned Photographs
93
Collecting Highly Parallel Data for Paraphrase Evaluation
94
ImageNet: A large-scale hierarchical image database
95
GPT-4V(ision) System Card
96
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
97
Perception Test : A Diagnostic Benchmark for Multimodal Models
98
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
101
Internlm: A multilingual language model with progressively enhanced capabilities
102
Vicuna: An open-source chatbot impress-ing gpt-4 with 90% chatgpt quality