Image-to-Text Retrieval

Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.

Benchmarks

Libraries

Datasets

Subtasks

Most implemented papers

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Content

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

Deep visual-semantic alignments for generating image descriptions

FLAVA: A Foundational Language And Vision Alignment Model

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Exploring Models and Data for Remote Sensing Image Caption Generation

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval