Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog (2019-06-30T00:00:00.000000Z)

TL;DR

This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training.

Abstract

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.

Authors

Àgata Lapedriza

10 papers

S. Gu

11 papers

Asma Ghandeharioun

4 papers

TL;DR

Abstract

Authors

References72 items

Fine-Tuning Language Models from Human Preferences

Striving for Simplicity in Off-policy Deep Reinforcement Learning

HappyBot: Generating Empathetic Dialogue Responses by Improving User Experience Look-ahead

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

CrossNorm: Normalization for Off-Policy TD Reinforcement Learning

Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

The Design and Implementation of XiaoIce, an Empathetic Social Chatbot

Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning

Off-Policy Deep Reinforcement Learning without Exploration

Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning

Sentiment Adaptive End-to-End Dialog Systems

Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems

A Hierarchical Latent Structure for Variational Conversation Modeling

Addressing Function Approximation Error in Actor-Critic Methods

Maximum a Posteriori Policy Optimisation

Efficient Exploration Through Bayesian Deep Q-Networks

More Robust Doubly Robust Off-policy Evaluation

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Iterative policy learning in end-to-end trainable task-oriented neural dialog models

A Deep Reinforcement Learning Chatbot

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

Hindsight Experience Replay

Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management

Deep Reinforcement Learning from Human Preferences

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Reinforcement Learning with Deep Energy-Based Policies

Uncertainty-Aware Reinforcement Learning for Collision Avoidance

Adversarial Learning for Neural Dialogue Generation

Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control

Dialogue Learning With Human-In-The-Loop

Policy Networks with Two-Stage Training for Dialogue Systems

Safe and Efficient Off-Policy Reinforcement Learning

Deep Reinforcement Learning for Dialogue Generation

A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Deep Exploration via Bootstrapped DQN

Taming the Noise in Reinforcement Learning via Soft Updates

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Deep Reinforcement Learning with Double Q-Learning

Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Distilling the Knowledge in a Neural Network

Trust Region Policy Optimization

The Role of “Active Listening” in Informal Helping Conversations: Impact on Perceptions of Listener Helpfulness, Sensitivity, and Supportiveness and Discloser Emotional Improvement

Playing Atari with Deep Reinforcement Learning

Listening Competence in Initial Interactions I: Distinguishing Between What Listening Is and What Listeners Do

On-line policy optimisation of spoken dialogue systems via live interaction with human subjects

Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs

Language Style Matching Predicts Relationship Initiation and Stability

Relative Entropy Policy Search

Active Listening in Peer Interviews: The Influence of Message Paraphrasing on Perceptions of Listening Skill

Optimal control as a graphical model inference problem

Maximum Entropy Inverse Reinforcement Learning

Linearly-solvable Markov decision problems

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

Where to look: a study of human-robot engagement

A Natural Policy Gradient

Eligibility Traces for Off-Policy Policy Evaluation

Functions of humor in the conversations of men and women

Laughter

Language Models are Unsupervised Multitask Learners

Off-policy policy gradient with state distribution correction

Microsoft deletes ’teen girl’ ai after it became a hitler-loving sex robot within 24 hours

On stochastic optimal control and reinforcement learning by approximate inference

Off‐Policy Actor‐Criticアルゴリズムによる強化学習

Learning to Achieve Goals

Stochastic Optimal Control

Stochastic Optimal Control

Generating Empathetic Dialogue Responses by Improving User Experience Lookahead

Field of Study

Journal Information

Name

Volume