3260 papers • 126 benchmarks • 313 datasets
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. ( Image credit: Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout )
(Image credit: Papersgraph)
These leaderboards are used to track progress in vision-language-navigation-5
Use these libraries to find vision-language-navigation-5 models and implementations
No subtasks available.
This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.

A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.
A general cross-lingual VLN framework to enable instruction-following navigation for different languages is proposed and an adversarial domain adaption loss is introduced to improve the transferring ability of the model when given a certain amount of target language data.
A novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task and significantly outperforms the baselines and achieves the best on the real- world Room-to-Room dataset.
The Frontier Aware Search with backTracking (FAST) Navigator is presented, a general framework for action decoding, that achieves state-of-the-art results on the 2018 Room-to-Room (R2R) Vision-and-Language navigation challenge.
This paper presents a generalizable navigational agent, trained in two stages via mixed imitation and reinforcement learning, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard.
This work introduces a multitask navigation model that can be seamlessly trained on both Vision-Language Navigation and Navigation from Dialog History tasks, and proposes to learn environment-agnostic representations for the navigation policy that are invariant among the environments seen during training, thus generalizing better on unseen environments.
This work proposes an end-to-end framework for learning an exploration policy that decides i) when and where to explore, ii) what information is worth gathering during exploration, and iii) how to adjust the navigation decision after the exploration.
This paper proposes a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data, and proposes a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following.
Adding a benchmark result helps the community track progress.