3260 papers • 126 benchmarks • 313 datasets
Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG: What is the main focus in a query? How to understand an image? How to locate an object?
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-grounding-7
Use these libraries to find visual-grounding-7 models and implementations
ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.
OFA is proposed, a Task-Agnostic and Modality- agnostic framework that supports Task Comprehensiveness and achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni- modal tasks.
This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.
Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.
This work validates the Cross-Prompt Attack (CroPA) and confirms its superior cross-prompt transferability compared to existing baselines and provides a more robust framework for generating transferable adversarial examples, with significant implications for understanding the security of VLMs in real-world applications.
A grounded dialogue state encoder is proposed which addresses a foundational issue on how to integrate visual grounding with dialogue system components and shows that the introduction of both the joint architecture and cooperative learning lead to accuracy improvements over the baseline system.
It is shown that powerful word segmentation and clustering capability emerges within the model's self-attention heads, suggesting that the visual grounding task is a crucial component of the word discovery capability the authors observe.
A novel approach where the two processes for activity classification and entity estimation are interactive and complementary, which achieves the state of the art in all evaluation metrics on the SWiG dataset.
Adding a benchmark result helps the community track progress.