3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in image-comprehension-9
No benchmarks available.
Use these libraries to find image-comprehension-9 models and implementations
No subtasks available.
This work introduces Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs), and proposes to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count to enhance visual tokens.
This paper proposes ArtGPT-4, a pioneering large vision-language model tailored to address the limitations of existing models in artistic comprehension, and shows that it can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation.
This work introduces a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images within the context of multi-modal visual understanding and introduces an external subset with results of another 22 text-to-image generative models, which makes JourneyDB a comprehensive benchmark for evaluating the comprehension of generated images.
The resulting model, named HIPIE, tackles HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework and achieves the state-of-the-art results at various levels of image comprehension.
Experimental results verify that the freezing of the Q-Former can preserve the image comprehension capability of BILP-2 and further gain a comprehension of the newly introduced point cloud modality and regional objects.
This work proposes InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition that achieves competitive text-image composition scores compared to public solutions, including GPT4-V and GPT3.5.
Adding a benchmark result helps the community track progress.