reasoning-4

Natural Language Visual Grounding

3260 papers • 126 benchmarks • 313 datasets

This task has no description! Would you like to contribute one?

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in natural-language-visual-grounding-4

Trend

Dataset

Best Model

Actions

No benchmarks available.

Libraries

i

Use these libraries to find natural-language-visual-grounding-4 models and implementations

Datasets

MAD

Subtasks

No subtasks available.

Most implemented papers

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

D. Fox, Roozbeh Mottaghi, Winson Han, Jesse Thomason, Yonatan Bisk, Luke Zettlemoyer, Daniel Gordon, Mohit Shridhar•Mon Dec 02 2019

It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

953

Content

METU-VIREF Dataset

0

Paper Graph

Grounding of Textual Phrases in Images by Reconstruction

Trevor Darrell, B. Schiele, Marcus Rohrbach, Anna Rohrbach, Ronghang Hu•Wed Nov 11 2015

A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.

511 0

Paper Graph

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Z. Kira, R. Socher, Caiming Xiong, Zuxuan Wu, G. Al-Regib, Jiasen Lu, Chih-Yao Ma•Sun Jan 06 2019

A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.

302 0

Paper Graph

Composing Pick-and-Place Tasks By Grounding Language

Wolfram Burgard, Oier Mees•Mon Feb 15 2021

This work presents a robot system that follows unconstrained language instructions to pick and place arbitrary objects and effectively resolves ambiguities through dialogues and demonstrates the effectiveness of the method in understanding pick-and-place language instructions and sequentially composing them to solve tabletop manipulation tasks.

37 0

Paper Graph

Robust Change Captioning

Trevor Darrell, Anna Rohrbach, Dong Huk Park•Mon Jan 07 2019

Describing what has changed in a scene can be useful to a user, but only if generated text focuses on what is semantically relevant. It is thus important to distinguish distractors (e.g. a viewpoint change) from relevant changes (e.g. an object has moved). We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning. Our model learns to distinguish distractors from semantic changes, localize the changes via Dual Attention over “before” and “after” images, and accurately describe them in natural language via Dynamic Speaker, by adaptively focusing on the necessary visual inputs (e.g. “before” or “after” image). To study the problem in depth, we collect a CLEVR-Change dataset, built off the CLEVR engine, with 5 types of scene changes. We benchmark a number of baselines on our dataset, and systematically study different change types and robustness to distractors. We show the superiority of our DUDA model in terms of both change captioning and localization. We also show that our approach is general, obtaining state-of-the-art results on the recent realistic Spot-the-Diff dataset which has no distractors.

201 0

Paper Graph

Modularized Textual Grounding for Counterfactual Resilience

Charless C. Fowlkes, Shu Kong, Zhiyuan Fang, Yezhou Yang•Sat Apr 06 2019

This work proposes a visual grounding system which is end-to-end trainable in a weakly supervised fashion with only image-level annotations, and counterfactually resilient owing to the modular design, and decomposes textual descriptions into three levels: entity, semantic attribute, color information, and perform compositional grounding progressively.

33 0

Paper Graph

Searching for Ambiguous Objects in Videos using Relational Referring Expressions

Sinan Kalkan, Hazan Anayurt, Sezai Artun Ozyegin, Ulfet Cetin, Utku Aktaş•Fri Aug 02 2019

This paper introduces a new dataset for video object search with referring expressions that includes numerous copies of the objects, making it difficult to use non-relational expressions, and proposes a deep attention network that significantly outperforms the baselines on this dataset.

9 0

Paper Graph

Learning Cross-modal Context Graph for Visual Grounding

Xiao-Dan Zhu, Xuming He, Yongfei Liu, Bo Wan•Tue Nov 19 2019

A language-guided graph representation is proposed to capture the global context of grounding entities and their relations, and a cross-modal graph matching strategy for the multiple-phrase visual grounding task is developed.

99 0

Paper Graph

A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions

Takuma Udagawa, Akiko Aizawa, T. Yamazaki•Tue Oct 06 2020

This work focuses on OneCommon Corpus (CITATION), a simple yet challenging common grounding dataset which contains minimal bias by design and provides comprehensive and reliable annotation for 600 dialogues, showing that the annotation captures important linguistic structures including predicate-argument structure, modification and ellipsis.

12 0

Paper Graph

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Adam Trischler, Matthew J. Hausknecht, Yonatan Bisk, Xingdi Yuan, Marc-Alexandre Côté, Mohit Shridhar•Wed Oct 07 2020

ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld and then execute goals from the ALFRED benchmark in a rich visual environment, enables the creation of a new BUTLER agent whose abstract knowledge corresponds directly to concrete, visually grounded actions.

656 0

Paper Graph

Adding a benchmark result helps the community track progress.