3260 papers • 126 benchmarks • 313 datasets
Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.
(Image credit: Papersgraph)
These leaderboards are used to track progress in image-to-text-retrieval-3
Use these libraries to find image-to-text-retrieval-3 models and implementations
No subtasks available.
BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods, and is demonstrated's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
A contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.
The Image-Grounded Language Understanding Evaluation benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.
A model that generates natural language descriptions of images and their regions using a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
This work introduces FLAVA as a single holistic universal model, as a “foundation”, that targets all modalities at once and demonstrates impressive performance on a wide range of 35 tasks spanning these target modalities.
This work proposes a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework that outperforms both UNITER and OpenAI CLIP on various downstream tasks and builds a large queue-based dictionary that can incorporate more negative samples in limited GPU resources.
A large-scale aerial image data set is constructed for remote sensing image caption and extensive experiments demonstrate that the content of theRemote sensing image can be completely described by generating language descriptions.
Experimental results show that OPT can learn strong image-text-audio multi- modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.
An effective metric, named Average Semantic Precision (ASP), is presented, which can measure the ranking precision of semantic correlation for retrieval sets and a novel and concise objective, coined Differentiable ASP Approximation (DAA), which can optimize ASP directly by making the ranking function of ASP differentiable through a sigmoid function.
Adding a benchmark result helps the community track progress.