3260 papers • 126 benchmarks • 313 datasets
The problem of retrieving images from a database based on a multi-modal (image- text) query. Specifically, the query text prompts some modification in the query image and the task is to retrieve images with the desired modifications.
(Image credit: Papersgraph)
These leaderboards are used to track progress in image-retrieval-with-multi-modal-query-3
Use these libraries to find image-retrieval-with-multi-modal-query-3 models and implementations
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
This work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.
It is shown that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning.
This paper proposes a new way to combine image and text through residual connection, that outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset the authors create based on CLEVR.
The proposed network-joint network with the CNN for ImageQA and the parameter prediction network-is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU.
This paper proposes an automatic spatially-aware concept discovery approach using weakly labeled image-text data from shopping websites, and decomposes the visual-semantic embedding space into multiple concept-specific subspaces to facilitate structured browsing and attribute-feedback product retrieval.
This paper proposes an autoencoder based model, ComposeAE, to learn the composition of image and text query for retrieving images, which is able to outperform the state-of-the-art method TIRG on three benchmark datasets, namely: MIT-States, Fashion200k and Fashion IQ.
A unified learning approach to simultaneously modeling the coarse- and fine-grained retrieval by considering the multi-grained uncertainty is introduced, which prevents the model from pushing away potential candidates in the early stage, and thus improves the recall rate.
Adding a benchmark result helps the community track progress.