3260 papers • 126 benchmarks • 313 datasets
Cross-Modal Information Retrieval (CMIR) is the task of finding relevant items across different modalities. For example, given an image, find a text or vice versa. The main challenge in CMIR is known as the heterogeneity gap: since items from different modalities have different data types, the similarity between them cannot be measured directly. Therefore, the majority of CMIR methods published to date attempt to bridge this gap by learning a latent representation space, where the similarity between items from different modalities can be measured. Source: Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
(Image credit: Papersgraph)
These leaderboards are used to track progress in cross-modal-information-retrieval
No benchmarks available.
Use these libraries to find cross-modal-information-retrieval models and implementations
No datasets available.
A novel deep neural network based architecture is proposed which is considered to learn a discriminative shared feature space for all the input modalities, suitable for semantically coherent information retrieval.
This paper proposes a novel GAN-based model that can retrieve relevant images in a zero-shot setup and comfortably outperforms several state-of-the-art zero- shot text to image retrieval models, as well as zero-shots classification and hashing models suitably used for retrieval.
It is argued that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval.
A Generalized Pooling Operator (GPO) is proposed, which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient and can be a plug-and-play feature aggregation module for standard VSE models.
A novel approach is presented that finds the underlying semantics of image descriptions and proposes a novel semantically enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between irrelevant image-description pairs.
This work proposes a new cross-modal small molecule retrieval task, designed to force a model to learn to associate the structure of a small molecule with the transcriptional change it induces, developed formally as multi-view alignment problem, and presents a coordinated deep learning approach.
This paper proposes various neural network models of increasing complexity that learn to generate, from a short descriptive text, a high level visual representation in a visual feature space such as the pool5 layer of the ResNet-152 or the fc6–fc7 layers of an AlexNet trained on ILSVRC12 and Places databases.
This paper proposes VisualSparta, a novel (Visual-text Sparse Transformer Matching) model that shows significant improvement in terms of both accuracy and efficiency and is capable of outperforming previous state-of-the-art scalable methods in MSCOCO and Flickr30K.
Adding a benchmark result helps the community track progress.