Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (2022-04-04T00:00:00.000000Z)

TL;DR

This work proposes to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate, and shows how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions.

Abstract

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's"hands and eyes,"while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.

References103 items

PI-QT-Opt: Predictive Information Improves Multi-Task Robotic Reinforcement Learning at Scale

Inner Monologue: Embodied Reasoning through Planning with Language Models

PaLM: Scaling Language Modeling with Pathways

R3M: A Universal Visual Representation for Robot Manipulation

Training language models to follow instructions with human feedback

A data-driven approach for learning to control computers

BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning

Pre-Trained Language Models for Interactive Decision-Making

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Can Wikipedia Help Offline Reinforcement Learning?

LaMDA: Language Models for Dialog Applications

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning

Skill Induction and Planning with Latent Language

CLIPort: What and Where Pathways for Robotic Manipulation

Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks

Finetuned Language Models Are Zero-Shot Learners

Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation

On the Opportunities and Risks of Foundation Models

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments

Language Grounding with 3D Objects

A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution

MERLOT: Multimodal Neural Script Knowledge Models

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

ReLMoGen: Integrating Motion Generation in Reinforcement Learning for Mobile Manipulation

Episodic Transformer for Vision-and-Language Navigation

Understanding by Understanding Not: Modeling Negation in Language Models

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale

Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Learning Transferable Visual Models From Natural Language Supervision

RetinaGAN: An Object-aware Approach to Sim-to-Real Transfer

Language-Conditioned Imitation Learning for Robot Manipulation Tasks

Broadly-Exploring, Local-Policy Trees for Long-Horizon Task Planning

PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

Language Models are Few-Shot Learners

Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text

Grounding Language in Play

Robots That Use Language

Thinking While Moving: Deep Reinforcement Learning with Concurrent Control

Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Online Replanning in Belief Space for Partially Observable Task and Motion Problems

HRL4IN: Hierarchical Reinforcement Learning for Interactive Navigation with Mobile Manipulators

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Self-Educated Language Agent with Hindsight Experience Replay for Instruction Following

Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation

Regression Planning Networks

VisualBERT: A Simple and Performant Baseline for Vision and Language

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Language as an Abstraction for Hierarchical Deep Reinforcement Learning

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

A Survey of Reinforcement Learning Informed by Natural Language

VideoBERT: A Joint Model for Video and Language Representation Learning

Human

Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration

Differentiable Physics and Stable Modes for Tool-Use and Manipulation Planning

Universal Sentence Encoder

Semi-parametric Topological Memory for Navigation

Guiding Exploratory Behaviors for Multi-Modal Grounding of Linguistic Descriptions

Neural Task Programming: Learning to Generalize Across Hierarchical Tasks

Grounded Language Learning in a Simulated 3D World

Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning

Attention is All you Need

Visual Semantic Planning Using Deep Successor Representations

Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

Modular Multitask Reinforcement Learning with Policy Sketches

Learning Language Games through Interaction

Prioritized Experience Replay

Logic-Geometric Programming: An Optimization-Based Approach to Combined Task and Motion Planning

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Combined task and motion planning through an extensible planner-independent interface layer

Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions

Asking for Help Using Inverse Semantics

A Joint Model of Language and Perception for Grounded Attribute Learning

RoboFrameNet: Verb-centric semantics for actions in robot middleware

Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation

Hierarchical Planning in the Now

Toward understanding natural language directions

The theory of affordances

Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions

Planning algorithms

SHOP: Simple Hierarchical Ordered Planner

Grounding language in perception

The Ecological Approach to Visual Perception

STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving

Inventing Relational State and Action Abstractions for Effective and Efficient Bilevel Planning

Grounding Language to Autonomously-Acquired Skills via Goal Generation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TF-Agents: A library for reinforcement learning in tensorflow

Polosukhin. Attention is all you need. Advances in neural information processing systems

A Structure for Plans and Behavior

Understanding natural language

Robot: 1. find a water bottle, 2. pick up the water bottle, 3. go to the counter, 4. put down the water bottle, 5. done

Move the grapefruit drink from the table to the close counter

100

How would you put the grapes in the bowl and then move the cheese to the table?

101

Explanation: The user has asked for snacks, I will bring jalapeno chips and an apple

102

Robot: 1. put down the banana, 2. done

103

Explanation: The user has asked me to move the grapefruit drink to the counter