1
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
2
Improving Compositional Text-to-image Generation with Large Vision-Language Models
3
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning
4
Improved Baselines with Visual Instruction Tuning
5
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
6
DreamLLM: Synergistic Multimodal Comprehension and Creation
7
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
8
ImageBind-LLM: Multi-modality Instruction Tuning
9
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
10
PointLLM: Empowering Large Language Models to Understand Point Clouds
11
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
12
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
13
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
14
Llama 2: Open Foundation and Fine-Tuned Chat Models
15
MMBench: Is Your Multi-modal Model an All-around Player?
16
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
17
Kosmos-2: Grounding Multimodal Large Language Models to the World
18
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
19
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
20
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
21
PandaGPT: One Model To Instruction-Follow Them All
22
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
23
Evaluating Object Hallucination in Large Vision-Language Models
24
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
25
ImageBind One Embedding Space to Bind Them All
26
Otter: A Multi-Modal Model With In-Context Instruction Tuning
27
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
28
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
29
Visual Instruction Tuning
30
DINOv2: Learning Robust Visual Features without Supervision
31
Inpaint Anything: Segment Anything Meets Image Inpainting
32
Instruction Tuning with GPT-4
34
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
35
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
37
Universal Instance Perception as Object Discovery and Retrieval
38
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
39
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
40
Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners
41
Language Is Not All You Need: Aligning Perception with Language Models
42
LLaMA: Open and Efficient Foundation Language Models
43
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
44
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
45
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
46
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
47
OPT: Open Pre-trained Transformer Language Models
48
Visual Spatial Reasoning
49
Flamingo: a Visual Language Model for Few-Shot Learning
50
Training language models to follow instructions with human feedback
51
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
52
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
53
High-Resolution Image Synthesis with Latent Diffusion Models
54
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
55
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
56
Resolution-robust Large Mask Inpainting with Fourier Convolutions
57
Learning Transferable Visual Models From Natural Language Supervision
58
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
59
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
60
Language Models are Few-Shot Learners
61
TextCaps: a Dataset for Image Captioning with Reading Comprehension
62
OCR-VQA: Visual Question Answering by Reading Text in Images
63
LVIS: A Dataset for Large Vocabulary Instance Segmentation
64
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
65
Towards VQA Models That Can Read
66
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
67
VizWiz Grand Challenge: Answering Visual Questions from Blind People
68
Attention is All you Need
69
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
70
Deep Residual Learning for Image Recognition
71
Generation and Comprehension of Unambiguous Object Descriptions
72
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
73
VQA: Visual Question Answering
74
ReferItGame: Referring to Objects in Photographs of Natural Scenes
75
ImageNet Large Scale Visual Recognition Challenge
76
Microsoft COCO: Common Objects in Context
77
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
78
3D-LLM: Injecting the 3D World into Large Language Models
79
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
80
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
81
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
82
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
83
Language Models are Unsupervised Multitask Learners
84
Improving Language Understanding by Generative Pre-Training
85
A method for stochastic optimization
87
Introducing our multimodal models
88
Minigpt-4: En-hancing vision-language understanding with advanced large language models
89
Stanford alpaca: An instruction-following llama model
91
OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models