cross-modal-retrieval

Image-text matching

3260 papers • 126 benchmarks • 313 datasets

This task has no description! Would you like to contribute one?

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in cross-modal-retrieval

Trend

Dataset

Best Model

Actions

CommercialAdsDataset

Libraries

i

Use these libraries to find cross-modal-retrieval models and implementations

salesforce/lavis

2 papers 8,608

Datasets

No datasets available.

Subtasks

No subtasks available.

Most implemented papers

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Zhe Gan, Xiaodong He, Pengchuan Zhang, Xiaolei Huang, Han Zhang, Tao Xu, Qiuyuan Huang•Mon Nov 27 2017

An Attentional Generative Adversarial Network that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation and for the first time shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image.

1893

Content

0

Paper Graph

UNITER: UNiversal Image-TExt Representation Learning

Zhe Gan, Linjie Li, Licheng Yu, Yen-Chun Chen, Jingjing Liu, Yu Cheng, Ahmed El Kholy, Faisal Ahmed•Tue Sep 24 2019

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

2505 0

Paper Graph

VinVL: Revisiting Visual Representations in Vision-Language Models

Jianfeng Gao, Lijuan Wang, Jianwei Yang, Pengchuan Zhang, Yejin Choi, Xiujun Li, Xiaowei Hu, Lei Zhang•Mon May 31 2021

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [20], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL.

1056 0

Paper Graph

Stacked Cross Attention for Image-Text Matching

Xiaodong He, Houdong Hu, G. Hua, Kuang-Huei Lee, Xi Chen•Tue Mar 20 2018

Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets.

1309 0

Paper Graph

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

S. Hoi, Akhilesh Deepak Gotmare, Junnan Li, Ramprasaath R. Selvaraju, Shafiq R. Joty, Caiming Xiong•Thu Jul 15 2021

A contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.

2513 0

Paper Graph

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Chunyuan Li, Jianfeng Gao, Houdong Hu, Lijuan Wang, Pengchuan Zhang, Lei Zhang, Yejin Choi, Li Dong, Furu Wei, Xiujun Li, Xiaowei Hu, Xi Yin•Sun Apr 12 2020

This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.

2155 0

Paper Graph

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Xizhou Zhu, Jifeng Dai, Yue Cao, Lewei Lu, Furu Wei, Weijie Su, Bin Li•Wed Aug 21 2019

A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.

1804 0

Paper Graph

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

S. Hoi, Junnan Li, Caiming Xiong, Dongxu Li•Thu Jan 27 2022

BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, and demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

5936 0

Paper Graph

Dual Attention Networks for Multimodal Reasoning and Matching

Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim•Tue Nov 01 2016

This work proposes Dual Attention Networks which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language and introduces two types of DANs for multimodal reasoning and matching, respectively.

703 0

Paper Graph

Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

Heng Tao Shen, Tan Wang, Jingkuan Song, Xing Xu, Yang Yang, A. Hanjalic•Sun Aug 11 2019

This work proposes a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image- text instance.

165 0

Paper Graph

Adding a benchmark result helps the community track progress.