1
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
2
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
3
Tool Learning with Foundation Models
4
Visual Instruction Tuning
5
OpenAGI: When LLM Meets Domain Experts
6
Instruction Tuning with GPT-4
7
A Survey of Large Language Models
8
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
9
Text2Motion: from natural language instructions to feasible plans
10
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
12
ViperGPT: Visual Inference via Python Execution for Reasoning
13
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
14
PaLM-E: An Embodied Multimodal Language Model
15
Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners
16
LLaMA: Open and Efficient Foundation Language Models
17
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
18
Self-Instruct: Aligning Language Models with Self-Generated Instructions
19
Visual Programming: Compositional visual reasoning without training
20
Scaling Instruction-Finetuned Language Models
21
Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning
22
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
23
Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification
24
Inner Monologue: Embodied Reasoning through Planning with Language Models
25
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning
26
Flamingo: a Visual Language Model for Few-Shot Learning
28
Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models
29
Training language models to follow instructions with human feedback
30
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
31
High-Resolution Image Synthesis with Latent Diffusion Models
32
VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
33
PointCLIP: Point Cloud Understanding by CLIP
34
ClipCap: CLIP Prefix for Image Captioning
35
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
36
FILM: Following Instructions in Language with Modular Methods
37
Finetuned Language Models Are Zero-Shot Learners
38
Learning to Prompt for Vision-Language Models
39
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
40
LoRA: Low-Rank Adaptation of Large Language Models
41
Compacter: Efficient Low-Rank Hypercomplex Adapter Layers
42
The Power of Scale for Parameter-Efficient Prompt Tuning
43
Learning Transferable Visual Models From Natural Language Supervision
44
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
45
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
46
Making Pre-trained Language Models Better Few-shot Learners
47
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
48
DocVQA: A Dataset for VQA on Document Images
49
Language Models are Few-Shot Learners
50
5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
51
Meshed-Memory Transformer for Image Captioning
52
Unified Vision-Language Pre-Training for Image Captioning and VQA
53
VisualBERT: A Simple and Performant Baseline for Vision and Language
54
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
55
Parameter-Efficient Transfer Learning for NLP
56
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering
57
A Comprehensive Survey of Deep Learning for Image Captioning
58
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
59
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
60
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
61
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
62
Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge
63
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
64
Microsoft COCO Captions: Data Collection and Evaluation Server
65
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
66
Koala: A dialogue model for academic research
67
Baize: An open-source chat model with parameter-efficient tuning on self-chat data
68
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
69
Prefix-Tuning: Optimizing Continuous Prompts for Generation
70
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
71
Language Models are Unsupervised Multitask Learners
72
Improving Language Understanding by Generative Pre-Training
73
Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality
74
Using lora for efficient stable diffusion fine-tuning
75
Figure 11. A Chatting Example using 7B LLaMA-Adapter V2. ference on Learning Representations
77
Stanford alpaca: An instruction-following llama model
78
Peft: State-of-the-art parameter-efficient fine-tuning methods
79
Sharegpt: Share your wildest chatgpt conversations with one click