Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

computer-vision-7

Visual Grounding

3260 papers • 126 benchmarks • 313 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG: What is the main focus in a query? How to understand an image? How to locate an object?

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in visual-grounding-7

Trend

Dataset

Best Model

Actions

RefCOCO+ testA

RefCOCO+ testA

RefCOCO+ test B

RefCOCO+ test B

RefCOCO+ val

RefCOCO+ val

Libraries

i

Use these libraries to find visual-grounding-7 models and implementations

modelscope/modelscope

4 papers 5,965

Datasets

RefCOCO

DIOR-RSVG

VizWiz Answer Grounding

VizWiz Answer Grounding

A Game Of Sorts

A Game Of Sorts

SK-VG

Subtasks

Person-centric Visual Grounding Phrase Extraction and Grounding (PEG)

Most implemented papers

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Devi Parikh, Dhruv Batra, Stefan Lee, Jiasen Lu•Mon Aug 05 2019

ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

4261

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

0

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang, Shuai Bai, Junyang Lin, Rui Men, Peng Wang, Zhikang Li•Sun Feb 06 2022

OFA is proposed, a Task-Agnostic and Modality- agnostic framework that supports Task Comprehensiveness and achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni- modal tasks.

1022 0

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Trevor Darrell, Marcus Rohrbach, Anna Rohrbach, Dong Huk Park, Akira Fukui, Daylen Yang•Sun Jun 05 2016

This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.

1550 0

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Jingren Zhou, Guohai Xu, Mingshi Yan, Ji Zhang, Haiyang Xu, Qinghao Ye, Jiabo Ye, Yaya Shi, Chenliang Li, Yuanhong Xu, Bin Bi, Qiuchen Qian, Wei Wang, Songfang Huang, Feiran Huang•Tue Jan 31 2023

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

223 0

Grounding of Textual Phrases in Images by Reconstruction

Trevor Darrell, B. Schiele, Marcus Rohrbach, Anna Rohrbach, Ronghang Hu•Wed Nov 11 2015

A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.

511 0

Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models

Atharv Mittal, Agam Pandey, Amritanshu Tiwari, Sukrit Jindal, Swadesh Swain•Fri Jun 27 2025

This work validates the Cross-Prompt Attack (CroPA) and confirms its superior cross-prompt transferability compared to existing baselines and provides a more robust framework for generating transferable adversarial examples, with significant implications for understanding the security of VLMs in real-world applications.

0 0

Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat

Elia Bruni, R. Bernardi, Barbara Plank, Tim Baumgärtner, R. Fernández, Ravi Shekhar, Aashish Venkatesh•Sun Sep 09 2018

A grounded dialogue state encoder is proposed which addresses a foundational issue on how to integrate visual grounding with dialogue system components and shows that the introduction of both the joint architecture and cooperative learning lead to accuracy improvements over the baseline system.

51 0

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan Peng, David Harwath•Sun Mar 27 2022

It is shown that powerful word segmentation and clustering capability emerges within the model's self-attention heads, suggesting that the visual grounding task is a crucial component of the word discovery capability the authors observe.

47 0

Collaborative Transformers for Grounded Situation Recognition

Junhyeong Cho, Youngseok Yoon, Suha Kwak•Tue Mar 29 2022

A novel approach where the two processes for activity classification and entity estimation are interactive and complementary, which achieves the state of the art in all evaluation metrics on the SWiG dataset.

36 0

Adding a benchmark result helps the community track progress.

Visual Grounding | State-of-the-Art