Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Published in

International Conference on Machine Learning(2018)

External Links:

Generate Graph

TL;DR

This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.

Abstract

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.

Authors

Tuomas Haarnoja

5 papers

Aurick Zhou

5 papers

P. Abbeel

35 papers

References42 items

Addressing Function Approximation Error in Actor-Critic Methods

Deep Reinforcement Learning that Matters

Proximal Policy Optimization Algorithms

Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Equivalence Between Policy Gradients and Soft Q-Learning

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Published in

International Conference on Machine Learning(2018)

External Links:

Generate Graph

TL;DR

Abstract

Authors

Tuomas Haarnoja

5 papers

Aurick Zhou

5 papers

P. Abbeel

35 papers

References42 items

Addressing Function Approximation Error in Actor-Critic Methods

Deep Reinforcement Learning that Matters

Proximal Policy Optimization Algorithms

Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Equivalence Between Policy Gradients and Soft Q-Learning

The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

The Reactor: A Sample-Efficient Actor-Critic Architecture

Data-efficient Deep Reinforcement Learning for Dexterous Manipulation

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Reinforcement Learning with Deep Energy-Based Policies

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

PGQ: Combining policy gradient and Q-learning

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

Benchmarking Deep Reinforcement Learning for Continuous Control

Asynchronous Methods for Deep Reinforcement Learning

Mastering the game of Go with deep neural networks and tree search

Taming the Noise in Reinforcement Learning via Soft Updates

Learning Continuous Control Policies by Stochastic Value Gradients

Continuous control with deep reinforcement learning

End-to-End Training of Deep Visuomotor Policies

Human-level control through deep reinforcement learning

Trust Region Policy Optimization

Adam: A Method for Stochastic Optimization

Bias in Natural Actor-Critic Algorithms

Deterministic Policy Gradient Algorithms

Playing Atari with Deep Reinforcement Learning

Guided Policy Search

On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference (Extended Abstract)

Double Q-learning

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Robot trajectory optimization using approximate inference

General duality between optimal control and estimation

Maximum Entropy Inverse Reinforcement Learning

Reinforcement learning of motor skills with policy gradients

Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization

Neuronlike adaptive elements that can solve difficult learning control problems

On the Theory of the Brownian Motion

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Reinforcement Learning: An Introduction