Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments (2017-11-20T00:00:00.000000Z)

TL;DR

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

Abstract

A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions, we present the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset1.

Authors

Anton van den Hengel

6 papers

Niko Sünderhauf

3 papers

I. Reid

18 papers

TL;DR

Abstract

Authors

References65 items

Gibson Env: Real-World Perception for Embodied Agents

CHALET: Cornell House Agent Learning Environment

Building Generalizable Agents with a Realistic and Rich 3D Environment

AI2-THOR: An Interactive 3D Environment for Visual AI

MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments

IQA: Visual Question Answering in Interactive Environments

Embodied Question Answering

HoME: a Household Multimodal Environment

Matterport3D: Learning from RGB-D Data in Indoor Environments

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Gated-Attention Architectures for Task-Oriented Language Grounding

ParlAI: A Dialog Research Software Platform

Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

A dataset for developing and benchmarking active vision

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Cognitive Mapping and Planning for Visual Navigation

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Listen

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Self-Critical Sequence Training for Image Captioning

Visual Dialog

Learning to Navigate in Complex Environments

Towards Cognitive Exploration through Deep Reinforcement Learning for Mobile Robots

Professor Forcing: A New Algorithm for Training Recurrent Networks

Target-driven visual navigation in indoor scenes using deep reinforcement learning

Deep Successor Reinforcement Learning

OpenAI Gym

Hierarchical Question-Image Co-Attention for Visual Question Answering

ViZDoom: A Doom-based AI research platform for visual reinforcement learning

A Deep Hierarchical Approach to Lifelong Learning in Minecraft

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Deep Residual Learning for Image Recognition

MovieQA: Understanding Stories in Movies through Question-Answering

Generation and Comprehension of Unambiguous Object Descriptions

Stacked Attention Networks for Image Question Answering

Effective Approaches to Attention-based Neural Machine Translation

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

SUN RGB-D: A RGB-D scene understanding benchmark suite

VQA: Visual Question Answering

Microsoft COCO Captions: Data Collection and Evaluation Server

Adam: A Method for Stochastic Optimization

ReferItGame: Referring to Objects in Photographs of Natural Scenes

Sequence to Sequence Learning with Neural Networks

Neural Machine Translation by Jointly Learning to Align and Translate

ImageNet Large Scale Visual Recognition Challenge

Caffe: Convolutional Architecture for Fast Feature Embedding

Grounding spatial relations for human-robot interaction

Indoor Segmentation and Support Inference from RGBD Images

Learning to Interpret Natural Language Navigation Instructions from Observations

Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation

Unbiased look at dataset bias

Natural language command of an autonomous micro-air vehicle

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Learning to Follow Navigational Directions

Toward understanding natural language directions

Et al

Solving Deep Memory POMDPs with Recurrent Policy Gradients

Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions

Long Short-Term Memory

Procedures As A Representation For Data In A Computer Program For Understanding Natural Language

Gershman

Derek Hoiem and R

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Design of Everyday Things

Field of Study

Journal Information

Name

Page