MOPO: Model-bas… (2020-05-27T00:00:00.000000Z)

TL;DR

A new model-based offline RL algorithm is proposed that applies the variance of a Lipschitz-regularized model as a penalty to the reward function, and it is found that this algorithm outperforms both standard model- based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks.

Abstract

Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a batch of previously collected data. This problem setting is compelling, because it offers the promise of utilizing large, diverse, previously collected datasets to acquire policies without any costly or dangerous active exploration, but it is also exceptionally difficult, due to the distributional shift between the offline training data and the learned policy. While there has been significant progress in model-free offline RL, the most successful prior methods constrain the policy to the support of the data, precluding generalization to new states. In this paper, we observe that an existing model-based RL algorithm on its own already produces significant gains in the offline setting, as compared to model-free approaches, despite not being designed for this setting. However, although many standard model-based RL methods already estimate the uncertainty of their model, they do not by themselves provide a mechanism to avoid the issues associated with distributional shift in the offline setting. We therefore propose to modify existing model-based RL methods to address these issues by casting offline model-based RL into a penalized MDP framework. We theoretically show that, by using this penalized MDP, we are maximizing a lower bound of the return in the true MDP. Based on our theoretical results, we propose a new model-based offline RL algorithm that applies the variance of a Lipschitz-regularized model as a penalty to the reward function. We find that this algorithm outperforms both standard model-based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks that require generalizing from data collected for a different task.

Abstract

References77 items

MOReL : Model-Based Offline Reinforcement Learning

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

Interference and Generalization in Temporal Difference Learning

Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning

AlgaeDICE: Policy Gradient from Arbitrary Experience

RoboNet: Large-Scale Multi-Robot Learning

Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Behavior Regularized Offline Reinforcement Learning

Deep Dynamics Models for Learning Dexterous Manipulation

Striving for Simplicity in Off-policy Deep Reinforcement Learning

Benchmarking Model-Based Reinforcement Learning

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Exploring Model-based Planning with Policy Networks

When to Trust Your Model: Model-Based Policy Optimization

Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Model-Based Reinforcement Learning for Atari

Guidelines for reinforcement learning in healthcare

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

Off-Policy Deep Reinforcement Learning without Exploration

Quantifying Generalization in Reinforcement Learning

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Model-Based Reinforcement Learning via Meta-Policy Optimization

Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees

Accurate Uncertainties for Deep Learning Using Calibrated Regression

A Dissection of Overfitting and Generalization in Continuous Reinforcement Learning

The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling

World Models

Addressing Function Approximation Error in Actor-Critic Methods

Spectral Normalization for Generative Adversarial Networks

Model-Ensemble Trust-Region Policy Optimization

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Proximal Policy Optimization Algorithms

Imagination-Augmented Agents for Deep Reinforcement Learning

Value Prediction Network

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

The Predictron: End-To-End Learning and Planning

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Deep visual foresight for planning robot motion

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Safe and Efficient Off-Policy Reinforcement Learning

Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks

Optimal control with learned local models: Application to dexterous manipulation

Value Iteration Networks

Asynchronous Methods for Deep Reinforcement Learning

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Continuous control with deep reinforcement learning

Trust Region Policy Optimization

Wiener's Polynomial Chaos for the Analysis and Control of Nonlinear Dynamical Systems with Probabilistic Uncertainties [Historical Perspectives]

Guided Policy Search

MuJoCo: A physics engine for model-based control

Linear Off-Policy Actor-Critic

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Scalable Approach to Uncertainty Quantification and Robust Design of Interconnected Dynamical Systems

ImageNet: A large-scale hierarchical image database

On integral probability metrics, φ-divergences and binary classification

An analysis of model-based Interval Estimation for Markov Decision Processes

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

A Kernel Approach to Comparing Distributions

Off-Policy Temporal Difference Learning with Function Approximation

Integral Probability Metrics and Their Generating Classes of Functions

Model Predictive Control Using Neural Networks [25 Years Ago]

Dyna, an integrated architecture for learning, planning, and reacting

Neuronlike adaptive elements that can solve difficult learning control problems

Some Asymptotic Theory for the Bootstrap

Dynamic Systems

Improving PILCO with Bayesian Neural Network Dynamics Models

Safe Reinforcement Learning

Batch Reinforcement Learning

Multi-Step Dyna Planning for Policy Evaluation and Control

• Uncertainty measurement drives even larger gaps between theory and empirical algorithm