3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in vision-and-language-navigation-6
Use these libraries to find vision-and-language-navigation-6 models and implementations
No subtasks available.
This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.
This work introduces the Touchdown task and dataset, where an agent must first follow navigation instructions in a Street View environment to a goal position, and then guess a location in its observed environment described in natural language to find a hidden object.
The Touchdown dataset provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location and this work publicly releases the 29k raw Street View panoramas needed for Touchdown.
It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and achieves competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V & L Navigation tasks.
This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.
A language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions is developed, suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions.
The size, scope and detail of Room-Across-Room (RxR) dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.
A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.
In this work, BnB1 is introduced, a large-scale and diverse in-domain VLN dataset that is used to pretrain the Airbert2 model that can be adapted to discriminative and generative settings and outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks.
A novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task and significantly outperforms the baselines and achieves the best on the real- world Room-to-Room dataset.
Adding a benchmark result helps the community track progress.