1
Context-Aware Multi-View Summarization Network for Image-Text Matching
2
Associating Images with Sentences Using Recurrent Canonical Correlation Analysis
3
Multi-Modality Cross Attention Network for Image and Sentence Matching
4
SMAN: Stacked Multimodal Attention Network for Cross-Modal Image–Text Retrieval
5
Efficient Document Re-Ranking for Transformers by Precomputing Term Representations
6
Transformer Reasoning Network for Image- Text Matching and Retrieval
7
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
8
Graph Structured Network for Image-Text Matching
9
IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
10
Image and Sentence Matching via Semantic Concepts and Order Learning
11
Cross-Modal Attention With Semantic Consistence for Image–Text Matching
12
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
13
Learning Fragment Self-Attention Embeddings for Image-Text Matching
14
ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching
15
UNITER: UNiversal Image-TExt Representation Learning
16
UNITER: Learning UNiversal Image-TExt Representations
17
Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching
18
Unified Vision-Language Pre-Training for Image Captioning and VQA
19
Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators
20
Learning visual features for relational CBIR
21
Visual Semantic Reasoning for Image-Text Matching
22
CycleMatch: A cycle-consistent embedding network for image-text matching
23
Adversarial Representation Learning for Text-to-Image Matching
24
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
25
Attention on Attention for Image Captioning
26
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
27
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
28
Saliency-Guided Attention Network for Image-Sentence Matching
29
Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching
30
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment
31
Know More Say Less: Image Captioning Based on Scene Graphs
32
Auto-Encoding Scene Graphs for Image Captioning
33
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
34
Joint Global and Co-Attentive Representation Learning for Image-Sentence Retrieval
35
Exploring Visual Relationship for Image Captioning
36
Learning Relationship-Aware Visual Features
37
Graph R-CNN for Scene Graph Generation
38
Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation
39
Stacked Cross Attention for Image-Text Matching
40
Learning Semantic Concepts and Order for Image and Sentence Matching
41
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models
42
Learning a Recurrent Residual Fusion Network for Multimodal Matching
43
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
44
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
45
Attention is All you Need
46
A simple neural network module for relational reasoning
47
Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
48
Inferring and Executing Programs for Visual Reasoning
49
Learning to Reason: End-to-End Module Networks for Visual Question Answering
50
Self-Critical Sequence Training for Image Captioning
51
Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM
52
Graph-Structured Representations for Visual Question Answering
53
Linking Image and Text with 2-Way Nets
54
SPICE: Semantic Propositional Image Caption Evaluation
55
Picture it in your mind: generating high level visual representations from textual descriptions
56
Leveraging Visual Question Answering for Image-Caption Ranking
57
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
58
Order-Embeddings of Images and Language
59
Associating neural word embeddings with deep image representations using Fisher Vectors
60
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
61
Deep visual-semantic alignments for generating image descriptions
62
Microsoft COCO: Common Objects in Context
63
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
64
Efficient Estimation of Word Representations in Vector Space
65
ROUGE: A Package for Automatic Evaluation of Summaries
66
Multi-Modal Memory Enhancement Attention Network for Image-Text Matching
67
Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
68
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
69
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders