Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders (2020-08-12T00:00:00.000000Z)

TL;DR

It is argued that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval.

Abstract

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.

References69 items

Context-Aware Multi-View Summarization Network for Image-Text Matching

Associating Images with Sentences Using Recurrent Canonical Correlation Analysis

Multi-Modality Cross Attention Network for Image and Sentence Matching

SMAN: Stacked Multimodal Attention Network for Cross-Modal Image–Text Retrieval

Efficient Document Re-Ranking for Transformers by Precomputing Term Representations

Transformer Reasoning Network for Image- Text Matching and Retrieval

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Graph Structured Network for Image-Text Matching

IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Image and Sentence Matching via Semantic Concepts and Order Learning

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Learning Fragment Self-Attention Embeddings for Image-Text Matching

ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching

UNITER: UNiversal Image-TExt Representation Learning

UNITER: Learning UNiversal Image-TExt Representations

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

Unified Vision-Language Pre-Training for Image Captioning and VQA

Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators

Learning visual features for relational CBIR

Visual Semantic Reasoning for Image-Text Matching

CycleMatch: A cycle-consistent embedding network for image-text matching

Adversarial Representation Learning for Text-to-Image Matching

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Attention on Attention for Image Captioning

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Saliency-Guided Attention Network for Image-Sentence Matching

Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

Know More Say Less: Image Captioning Based on Scene Graphs

Auto-Encoding Scene Graphs for Image Captioning

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Joint Global and Co-Attentive Representation Learning for Image-Sentence Retrieval

Exploring Visual Relationship for Image Captioning

Learning Relationship-Aware Visual Features

Graph R-CNN for Scene Graph Generation

Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation

Stacked Cross Attention for Image-Text Matching

Learning Semantic Concepts and Order for Image and Sentence Matching

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

Learning a Recurrent Residual Fusion Network for Multimodal Matching

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Attention is All you Need

A simple neural network module for relational reasoning

Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics

Inferring and Executing Programs for Visual Reasoning

Learning to Reason: End-to-End Module Networks for Visual Question Answering

Self-Critical Sequence Training for Image Captioning

Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM

Graph-Structured Representations for Visual Question Answering

Linking Image and Text with 2-Way Nets

SPICE: Semantic Propositional Image Caption Evaluation

Picture it in your mind: generating high level visual representations from textual descriptions

Leveraging Visual Question Answering for Image-Caption Ranking

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Order-Embeddings of Images and Language

Associating neural word embeddings with deep image representations using Fisher Vectors

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep visual-semantic alignments for generating image descriptions

Microsoft COCO: Common Objects in Context

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Efficient Estimation of Word Representations in Vector Space

ROUGE: A Package for Automatic Evaluation of Summaries

Multi-Modal Memory Enhancement Attention Network for Image-Text Matching

Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders