3260 papers • 126 benchmarks • 313 datasets
Matching a target text to a source text based on their meaning.
(Image credit: Papersgraph)
These leaderboards are used to track progress in text-matching-9
No benchmarks available.
Use these libraries to find text-matching-9 models and implementations
No subtasks available.
An Attentional Generative Adversarial Network that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation and for the first time shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image.
This model, which resembles the compositional hierarchies of patterns in image recognition, can successfully identify salient signals such as n-gram and n-term matchings and demonstrates its superiority against the baselines.
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets.
This work introduces an approach for open-domain question answering (QA) that retrieves and reads a passage graph, where vertices are passages of text and edges represent relationships that are derived from an external knowledge base or co-occurrence in the same article.
A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.
A fast and strong neural approach for general purpose text matching applications and proposes to keep three key features available for inter-sequence alignment: original point-wise features, previous aligned features, and contextual features while simplifying all the remaining components.
Compared with current state-of-the-art methods, the proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images and achieves better performance on widely used datasets.
This work proposes Dual Attention Networks which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language and introduces two types of DANs for multimodal reasoning and matching, respectively.
This work proposes a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image- text instance.
Adding a benchmark result helps the community track progress.