Zero-Shot Cross-Modal Retrieval

Zero-Shot Cross-Modal Retrieval is the task of finding relevant items across different modalities without having received any training examples. For example, given an image, find a text or vice versa. The main challenge in the task is known as the heterogeneity gap: since items from different modalities have different data types, the similarity between them cannot be measured directly. Therefore, the majority of methods published to date attempt to bridge this gap by learning a latent representation space, where the similarity between items from different modalities can be measured. Source: Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Zero-Shot Cross-Modal Retrieval

Benchmarks

Libraries

Datasets

Subtasks

Most implemented papers

Learning Transferable Visual Models From Natural Language Supervision

Content

UNITER: UNiversal Image-TExt Representation Learning

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

CoCa: Contrastive Captioners are Image-Text Foundation Models

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Flamingo: a Visual Language Model for Few-Shot Learning

Reproducible Scaling Laws for Contrastive Language-Image Learning

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers