3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-instruction-following-12
Use these libraries to find visual-instruction-following-12 models and implementations
No subtasks available.
BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods, and is demonstrated's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
This paper presents LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding and introduces GPT-4 generated visual instruction tuning data, the model and code base publicly available.
This paper presents the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework, and shows that the fully-connected vision-language connector in LLaVA is surprisingly power-ful and data-efficient.
This paper conducts a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models, and introduces an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction.
This work annotate Instruction Clarification Requests in CoDraw, an existing dataset of interactions in a multimodal collaborative dialogue game, and shows that it contains lexically and semantically diverse iCRs being produced self-motivatedly by players deciding to clarify in order to solve the task successfully.
The ShareGPT4V dataset is introduced, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations, to serve as a pivotal resource for advancing the LMMs community.
This work introduces the Chain-of-Spot (CoS) method, which is described as Interactive Reasoning, a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions, thereby offering multi-granularity image features.
Adding a benchmark result helps the community track progress.