Mastering Atari, Go, chess and shogi by planning with a learned model (2019-11-19T00:00:00.000000Z)

TL;DR

The MuZero algorithm is presented, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics.

Abstract

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess1 and Go2, where a perfect simulator is available. However, in real-world problems, the dynamics governing the environment are often complex and unknown. Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. The MuZero algorithm learns an iterable model that produces predictions relevant to planning: the action-selection policy, the value function and the reward. When evaluated on 57 different Atari games3—the canonical video game environment for testing artificial intelligence techniques, in which model-based planning approaches have historically struggled4—the MuZero algorithm achieved state-of-the-art performance. When evaluated on Go, chess and shogi—canonical environments for high-performance planning—the MuZero algorithm matched, without any knowledge of the game dynamics, the superhuman performance of the AlphaZero algorithm5 that was supplied with the rules of the game. A reinforcement-learning algorithm that combines a tree-based search with a learned model achieves superhuman performance in high-performance planning and visually complex domains, without any knowledge of their underlying dynamics.

Authors

References56 items

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Off-Policy Actor-Critic with Shared Experience Replay

OpenSpiel: A Framework for Reinforcement Learning in Games

When to use parametric models in reinforcement learning?

DeepMDP: Learning Continuous Latent Space Models for Representation Learning

Model-Based Reinforcement Learning for Atari

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

Learning Latent Dynamics for Planning from Pixels

Recurrent Experience Replay in Distributed Reinforcement Learning

Recurrent World Models Facilitate Policy Evolution

Surprising Negative Results for Generative Adversarial Tree Search

Observe and Look Further: Achieving Consistent Performance on Atari

Distributed Prioritized Experience Replay

Learning and Querying Fast Generative Models for Reinforcement Learning

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Superhuman AI for heads-up no-limit poker: Libratus beats top professionals

TreeQN and ATreeC: Differentiable Tree Planning for Deep Reinforcement Learning

Mastering the game of Go without human knowledge

Rainbow: Combining Improvements in Deep Reinforcement Learning

Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents

Planning chemical syntheses with deep neural networks and symbolic AI

Value Prediction Network

Value-Aware Loss Function for Model-based Reinforcement Learning

DeepStack: Expert-level artificial intelligence in heads-up no-limit poker

The Predictron: End-To-End Learning and Planning

Reinforcement Learning with Unsupervised Auxiliary Tasks

Identity Mappings in Deep Residual Networks

Value Iteration Networks

Mastering the game of Go with deep neural networks and tree search

Prioritized Experience Replay

Learning Continuous Control Policies by Stochastic Value Gradients

Massively Parallel Methods for Deep Reinforcement Learning

Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images

Human-level control through deep reinforcement learning

From Pixels to Torques: Policy Learning with Deep Dynamical Models

Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics

ImageNet classification with deep convolutional neural networks

The Arcade Learning Environment: An Evaluation Platform for General Agents

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Multi-armed bandits with episode context

Single-Player Monte-Carlo Tree Search

Bandit Based Monte-Carlo Planning

Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search

Deep Blue

Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning

Markov Decision Processes: Discrete Stochastic Dynamic Programming

A World Championship Caliber Checkers Program

Iterative Value-Aware Model Learning

OpenAI

Reinforcement learning - an introduction, 2nd Edition

Author manuscript, published in "Computer and Games, Beijing: China (2008)" Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength

Reinforcement Learning: An Introduction

Planning and Scheduling

95 13,370.9 % asteroids 877.10 36,517.30 117,303.00 606,971.12 1,700.6 % atlantis 13

George van den Driessche, Thore Graepel, and Demis Hassabis

Actions available