1
Baichuan 2: Open Large-scale Language Models
2
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
3
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
4
VIGC: Visual Instruction Generation and Correction
5
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
6
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
7
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
8
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
9
Llama 2: Open Foundation and Fine-Tuned Chat Models
10
MMBench: Is Your Multi-modal Model an All-around Player?
11
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
12
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
13
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
14
Kosmos-2: Grounding Multimodal Large Language Models to the World
15
A survey on multimodal large language models
16
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
17
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
18
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
19
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
20
Otter: A Multi-Modal Model With In-Context Instruction Tuning
21
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
22
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
23
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
24
Visual Instruction Tuning
25
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text
27
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
28
PaLM-E: An Embodied Multimodal Language Model
29
LLaMA: Open and Efficient Foundation Language Models
30
Grounding Language Models to Images for Multimodal Inputs and Outputs
31
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
32
Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
33
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
34
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
35
LAION-5B: An open large-scale dataset for training next generation image-text models
36
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
37
GLIPv2: Unifying Localization and Vision-Language Understanding
38
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
39
GIT: A Generative Image-to-text Transformer for Vision and Language
40
OPT: Open Pre-trained Transformer Language Models
41
Visual Spatial Reasoning
42
PaLM: Scaling Language Modeling with Pathways
43
Training language models to follow instructions with human feedback
44
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
45
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
46
Grounded Language-Image Pre-training
47
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
48
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
49
LoRA: Low-Rank Adaptation of Large Language Models
50
GLM: General Language Model Pretraining with Autoregressive Blank Infilling
51
Learning Transferable Visual Models From Natural Language Supervision
52
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
53
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
54
Language Models are Few-Shot Learners
55
TextCaps: a Dataset for Image Captioning with Reading Comprehension
56
5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
57
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
58
OCR-VQA: Visual Question Answering by Reading Text in Images
59
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
60
Towards VQA Models That Can Read
61
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
62
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
63
Attention is All you Need
65
VQA: Visual Question Answering
66
Microsoft COCO Captions: Data Collection and Evaluation Server
67
Im2Text: Describing Images Using 1 Million Captioned Photographs
68
ImageNet: A large-scale hierarchical image database
69
Aligning Large Multi-Modal Model with Robust Instruction Tuning
70
Retrieval-Augmented Multimodal Language Modeling
71
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
72
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
73
TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training
74
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
75
Improving Language Understanding by Generative Pre-Training
76
Stanford alpaca: An instruction-following llama model
77
Internlm: A multilingual language model with progressively enhanced capabilities
78
Minigpt-4: Enhancing vision-language 13
80
Timoth ´ ee Lacroix, Baptiste
81
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
82
Introducing qwen-7b: Open foundation and humanaligned models (of the state-of-the-arts)