Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning (2021-09-23T00:00:00.000000Z)

TL;DR

Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art in multi-agent reinforcement learning.

Abstract

Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art.

Authors

Yaodong Yang

3 papers

Jun Wang

4 papers

J. Kuba

1 papers

TL;DR

Abstract

Authors

References39 items

Settling the Variance of Multi-Agent Policy Gradients

Tianshou: a Highly Modularized Deep Reinforcement Learning Library

A Game-Theoretic Approach to Multi-Agent Trust Region Optimization

MALib: A Parallel Framework for Population-based Multi-agent Reinforcement Learning

Learning in Nonzero-Sum Stochastic Games with Potentials

Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Game-Theoretic Multiagent Reinforcement Learning

Multiagent Trust Region Policy Optimization

EigenGame: PCA as a Nash Equilibrium

Multi-Agent Determinantal Q-Learning

Deep Multi-Agent Reinforcement Learning for Decentralized Continuous Cooperative Control

Multiagent Rollout Algorithms and Reinforcement Learning

Bi-level Actor-Critic for Multi-agent Coordination

The StarCraft Multi-Agent Challenge

Modelling Bounded Rationality in Multi-Agent Interactions by Generalized Recursive Reasoning

Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning

Benchmarking Reinforcement Learning Algorithms on Real-World Robots

Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Proximal Policy Optimization Algorithms

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Counterfactual Multi-Agent Policy Gradients

Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games

Benchmarking Deep Reinforcement Learning for Continuous Control

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Trust Region Policy Optimization

Adam: A Method for Stochastic Optimization

Deterministic Policy Gradient Algorithms

Approximately Optimal Approximate Reinforcement Learning

Policy Gradient Methods for Reinforcement Learning with Function Approximation

The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems

Markov Games as a Framework for Multi-Agent Reinforcement Learning

The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games

Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson

Non-Cooperative Games

2015a) (Appendix A). Theorem 1. (Schulman et al., 2015a, Theorem 1) Let π be the current policy and ?̄? be the next candidate policy

Reinforcement Learning: An Introduction

¯ π − i , the policy ¯ π i is optimal: ¯ π i = arg max π i J ( π i , ¯ π − i ) . As agent i was chosen arbitrarily, ¯ π is a Nash equilibrium

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names