1
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
2
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
3
Monkey: Image Resolution and Text Label are Important Things for Large Multi-Modal Models
4
mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
5
CogVLM: Visual Expert for Pretrained Language Models
6
Improved Baselines with Visual Instruction Tuning
8
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
9
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
10
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
11
Kosmos-2: Grounding Multimodal Large Language Models to the World
12
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
13
OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
14
LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
15
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
16
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
17
PandaGPT: One Model To Instruction-Follow Them All
18
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
19
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
20
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
21
Otter: A Multi-Modal Model With In-Context Instruction Tuning
22
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
23
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
24
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
25
Visual Instruction Tuning
26
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
28
LLaMA: Open and Efficient Foundation Language Models
29
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
31
Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks
32
Visual Spatial Reasoning
33
Flamingo: a Visual Language Model for Few-Shot Learning
34
Training language models to follow instructions with human feedback
35
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
36
LoRA: Low-Rank Adaptation of Large Language Models
37
GLM: General Language Model Pretraining with Autoregressive Blank Infilling
38
Language Models are Few-Shot Learners
39
KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment
40
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
41
Towards VQA Models That Can Read
42
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
43
Places: A 10 Million Image Database for Scene Recognition
44
VizWiz Grand Challenge: Answering Visual Questions from Blind People
45
Dual-Glance Model for Deciphering Social Relationships
46
ShapeWorld - A new test methodology for multimodal language understanding
47
Towards Automatic Learning of Procedures From Web Instructional Videos
49
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
50
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
51
Microsoft COCO Captions: Data Collection and Evaluation Server
52
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
53
Précis of Bayesian Rationality: The Probabilistic Approach to Human Reasoning
54
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
55
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
56
nocaps: novel object captioning at scale
57
Language Models are Unsupervised Multitask Learners
58
Image Quality : Determine the objective quality of the image, such as whether it is blurry, bright or dark, contrast, etc. 13
59
Gemini: a family of highly capable multimodal models
61
Minicpm: Unveiling the potential of end-side large language models, 2024. 8
62
Internlm: A multilingual language model with progressively enhanced capabilities
63
E Details of the evaluated models In Table
65
Nature Relation : Other abstract relationships that exist in nature. Examples: predation, symbiosis, coexistence, etc
66
OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models
67
Omnilmm: Large multi-modal models for strong performance and efficient deployment
68
Physical Relation : All relationships that exist in the physical world, 3D spatial relationships and the connections between objects are
69
Image Emotion : Determine which subjective emotion is conveyed by the overall image, such as cold, cheerful, sad, or oppressive
70
Action Recognition : Recognizing human actions, including pose motion
71
Image Scene : Determine which environment is shown in the image, such as indoors, outdoors, forest, city, mountains, waterfront, sunny day, rainy day, etc
72
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality