3260 papers • 126 benchmarks • 313 datasets
Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance. References: [1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study [2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval
(Image credit: Papersgraph)
These leaderboards are used to track progress in cross-modal-retrieval
Use these libraries to find cross-modal-retrieval models and implementations
A simple change to common loss functions used for multi-modal embeddings, inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, is introduced, which yields significant gains in retrieval performance.
This work introduces an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects, and finds that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for un supervised classification on SVHN and MNIST.
Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets.
A contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
A minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that the authors process textual inputs, showing that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance.
A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.
This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.
A noisy dataset of over one billion image alt-text pairs is leverage, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
It is argued that deterministic functions are not sufficiently powerful to capture one-to-many correspondences and proposed Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space.
Adding a benchmark result helps the community track progress.