3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in natural-language-visual-grounding-4
No benchmarks available.
Use these libraries to find natural-language-visual-grounding-4 models and implementations
No subtasks available.
It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.
A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.
This work presents a robot system that follows unconstrained language instructions to pick and place arbitrary objects and effectively resolves ambiguities through dialogues and demonstrates the effectiveness of the method in understanding pick-and-place language instructions and sequentially composing them to solve tabletop manipulation tasks.
Describing what has changed in a scene can be useful to a user, but only if generated text focuses on what is semantically relevant. It is thus important to distinguish distractors (e.g. a viewpoint change) from relevant changes (e.g. an object has moved). We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning. Our model learns to distinguish distractors from semantic changes, localize the changes via Dual Attention over “before” and “after” images, and accurately describe them in natural language via Dynamic Speaker, by adaptively focusing on the necessary visual inputs (e.g. “before” or “after” image). To study the problem in depth, we collect a CLEVR-Change dataset, built off the CLEVR engine, with 5 types of scene changes. We benchmark a number of baselines on our dataset, and systematically study different change types and robustness to distractors. We show the superiority of our DUDA model in terms of both change captioning and localization. We also show that our approach is general, obtaining state-of-the-art results on the recent realistic Spot-the-Diff dataset which has no distractors.
This work proposes a visual grounding system which is end-to-end trainable in a weakly supervised fashion with only image-level annotations, and counterfactually resilient owing to the modular design, and decomposes textual descriptions into three levels: entity, semantic attribute, color information, and perform compositional grounding progressively.
This paper introduces a new dataset for video object search with referring expressions that includes numerous copies of the objects, making it difficult to use non-relational expressions, and proposes a deep attention network that significantly outperforms the baselines on this dataset.
A language-guided graph representation is proposed to capture the global context of grounding entities and their relations, and a cross-modal graph matching strategy for the multiple-phrase visual grounding task is developed.
This work focuses on OneCommon Corpus (CITATION), a simple yet challenging common grounding dataset which contains minimal bias by design and provides comprehensive and reliable annotation for 600 dialogues, showing that the annotation captures important linguistic structures including predicate-argument structure, modification and ellipsis.
ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld and then execute goals from the ALFRED benchmark in a rich visual environment, enables the creation of a new BUTLER agent whose abstract knowledge corresponds directly to concrete, visually grounded actions.
Adding a benchmark result helps the community track progress.