robots-6

Vision and Language Navigation

3260 papers • 126 benchmarks • 313 datasets

This task has no description! Would you like to contribute one?

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in vision-and-language-navigation-6

Trend

Dataset

Best Model

Actions

VLN Challenge

Touchdown Dataset

RxR

Libraries

i

Use these libraries to find vision-and-language-navigation-6 models and implementations

google-research/valan

2 papers 70

Datasets

RxR

Subtasks

No subtasks available.

Most implemented papers

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Anton van den Hengel, Niko Sünderhauf, I. Reid, Qi Wu, Stephen Gould, Damien Teney, Peter Anderson, Mark Johnson, Jake Bruce•Sun Nov 19 2017

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

1573

Content

RxR

map2seq

robo-vln

RUN

0

TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

Noah Snavely, Alane Suhr, Yoav Artzi, Howard Chen, Dipendra Kumar Misra•Wed Nov 28 2018

This work introduces the Touchdown task and dataset, where an agent must first follow navigation instructions in a Street View environment to a goal position, and then guess a location in its observed environment described in natural language to find a hidden object.

440 0

Paper Graph

Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View

Yoav Artzi, Jason Baldridge, Eugene Ie, Piotr Wojciech Mirowski, Harsh Mehta•Thu Jan 09 2020

The Touchdown dataset provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location and this work publicly releases the 29k raw Street View panoramas needed for Touchdown.

28 0

Paper Graph

How Much Can CLIP Benefit Vision-and-Language Tasks?

Z. Yao, Sheng Shen, K. Keutzer, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Hao Tan, Liunian Harold Li•Mon Jul 12 2021

It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and achieves competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V & L Navigation tasks.

479 0

Paper Graph

The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation

Z. Kira, Caiming Xiong, Zuxuan Wu, G. Al-Regib, Chih-Yao Ma•Mon Mar 04 2019

This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.

188 0

Paper Graph

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

Dhruv Batra, Erik Wijmans, Jacob Krantz, Arjun Majumdar, Stefan Lee•Sun Apr 05 2020

A language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions is developed, suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions.

405 0

Paper Graph

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Jason Baldridge, Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie•Wed Oct 14 2020

The size, scope and detail of Room-Across-Room (RxR) dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.

422 0

Paper Graph

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Z. Kira, R. Socher, Caiming Xiong, Zuxuan Wu, G. Al-Regib, Jiasen Lu, Chih-Yao Ma•Sun Jan 06 2019

A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.

302 0

Paper Graph

Airbert: In-domain Pretraining for Vision-and-Language Navigation

I. Laptev, C. Schmid, Makarand Tapaswi, Shizhe Chen, Pierre-Louis Guhur•Thu Aug 19 2021

In this work, BnB1 is introduced, a large-scale and diverse in-domain VLN dataset that is used to pretrain the Airbert2 model that can be adapted to discriminative and generative settings and outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks.

168 0

Paper Graph

Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation

Hongmin Wang, William Yang Wang, Wenhan Xiong, Xin Eric Wang•Tue Mar 20 2018

A novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task and significantly outperforms the baselines and achieves the best on the real- world Room-to-Room dataset.

216 0

Paper Graph

Adding a benchmark result helps the community track progress.