natural-language-processing-6

visual instruction following

3260 papers • 126 benchmarks • 313 datasets

This task has no description! Would you like to contribute one?

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in visual-instruction-following-12

Trend

Dataset

Best Model

Actions

LLaVA-Bench

Libraries

i

Use these libraries to find visual-instruction-following-12 models and implementations

huggingface/transformers

3 papers 124,889

Datasets

LLaVA-Bench

Subtasks

No subtasks available.

Most implemented papers

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

S. Savarese, Junnan Li, Dongxu Li, Steven C. H. Hoi•Sun Jan 29 2023

BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods, and is demonstrated's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

6884

Content

salesforce/lavis

2 papers 8,722

0

Paper Graph

Visual Instruction Tuning

Chunyuan Li, Haotian Liu, Yong Jae Lee, Qingyang Wu•Sun Apr 16 2023

This paper presents LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding and introduces GPT-4 generated visual instruction tuning data, the model and code base publicly available.

7668 0

Paper Graph

Improved Baselines with Visual Instruction Tuning

Chunyuan Li, Haotian Liu, Yuheng Li, Yong Jae Lee•Wed Oct 04 2023

This paper presents the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework, and shows that the fully-connected vision-language connector in LLaVA is surprisingly power-ful and data-efficient.

4327 0

Paper Graph

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Junnan Li, Dongxu Li, Wenliang Dai, Boyang Albert Li, Pascale Fung, Steven C. H. Hoi, A. Tiong, Junqi Zhao, Weisheng Wang•Wed May 10 2023

This paper conducts a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models, and introduces an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction.

2982 0

Paper Graph

Instruction Clarification Requests in Multimodal Collaborative Dialogue Games: Tasks, and an Analysis of the CoDraw Dataset

David Schlangen, Brielen Madureira•Mon Feb 27 2023

This work annotate Instruction Clarification Requests in CoDraw, an existing dataset of interactions in a multimodal collaborative dialogue game, and shows that it contains lexically and semantically diverse iCRs being produced self-motivatedly by players deciding to clarify in order to solve the task successfully.

12 0

Paper Graph

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Conghui He, Lin Chen, Jinsong Li, Xiao-wen Dong, Pan Zhang, Jiaqi Wang, Feng Zhao, Dahua Lin•Mon Nov 20 2023

The ShareGPT4V dataset is introduced, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations, to serve as a pivotal resource for advancing the LMMs community.

944 0

Paper Graph

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Jiwen Lu, Yongming Rao, Zuyan Liu, Yuhao Dong, Jie Zhou•Mon Mar 18 2024

This work introduces the Chain-of-Spot (CoS) method, which is described as Interactive Reasoning, a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions, thereby offering multi-granularity image features.

44 0

Paper Graph

Adding a benchmark result helps the community track progress.