Cross-Modal Information Retrieval

Cross-Modal Information Retrieval (CMIR) is the task of finding relevant items across different modalities. For example, given an image, find a text or vice versa. The main challenge in CMIR is known as the heterogeneity gap: since items from different modalities have different data types, the similarity between them cannot be measured directly. Therefore, the majority of CMIR methods published to date attempt to bridge this gap by learning a latent representation space, where the similarity between items from different modalities can be measured. Source: Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Cross-Modal Information Retrieval

Benchmarks

Libraries

Datasets

Subtasks

Most implemented papers

CMIR-NET : A Deep Learning Based Model For Cross-Modal Retrieval In Remote Sensing

Content

ZSCRGAN: A GAN-based Expectation Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

Learning the Best Pooling Strategy for Visual Semantic Embedding

Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval

Cross-modal representation alignment of molecular structure and perturbation-induced transcriptional profiles

Picture it in your mind: generating high level visual representations from textual descriptions

VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words