Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

Published in

Neural Information Processing Systems(2022)

External Links:

Generate Graph DownloadPDF

TL;DR

An Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems, is proposed.

Abstract

We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions, which are to be jointly optimized according to given criteria such as proportional fairness (smooth concave scalarization), hard constraints (constrained MDP), and max-min trade-off. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems. Theoretically, the designed algorithms based on the ARNPG framework achieve $\tilde{O}(1/T)$ global convergence with exact gradients. Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.

Authors

Ruida Zhou

1 papers

Tao-Wen Liu

1 papers

D. Kalathil

2 papers

References49 items

Board

Towards Painless Policy Optimization for Constrained MDPs

Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

Published in

Neural Information Processing Systems(2022)

External Links:

Generate Graph DownloadPDF

TL;DR

Abstract

Authors

Ruida Zhou

1 papers

Tao-Wen Liu

1 papers

D. Kalathil

2 papers

References49 items

Board

Towards Painless Policy Optimization for Constrained MDPs

Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Joint Optimization of Concave Scalarized Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm

Beyond Cumulative Returns via Reinforcement Learning over State-Action Occupancy Measures

On the Linear Convergence of Natural Policy Gradient Algorithm

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

Provable Multi-Objective Reinforcement Learning with Generative Models

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

First Order Constrained Optimization in Policy Space

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

A Multi-Objective Approach to Mitigate Negative Side Effects

Linear Last-iterate Convergence in Constrained Saddle-point Optimization

On the Global Convergence Rates of Softmax Policy Gradient Methods

Reinforcement Learning for Joint Optimization of Multiple Rewards

A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

Online Primal-Dual Mirror Descent under Stochastic Constraints

A Primal-Dual Parallel Method with $O(1/\epsilon)$ Convergence for Constrained Composite Convex Programs

A Simple Parallel Algorithm with an O(1/t) Convergence Rate for General Convex Programs

Multi-Objective MDPs with Conditional Lexicographic Reward Preferences

Optimization, Learning, and Games with Predictable Sequences

A Survey of Multi-Objective Sequential Decision-Making

Scalarized multi-objective reinforcement learning: Novel design techniques

MuJoCo: A physics engine for model-based control

Mirror descent and nonlinear projected subgradient methods for convex optimization

Constrained Markov Decision Processes

Rate control for communication networks: shadow prices, proportional fairness and stability

Pareto Policy Adaptation

Fast Global Convergence of Policy Optimization for Constrained MDPs

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

First-order and Stochastic Optimization Methods for Machine Learning

Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems

Inequality (a) follows from the Cauchy-Schwarz inequality for the Ψ-norm; (b) from the triangle inequality; (c) from the smoothness of function Φ defined in (41); and (d) from ac + bc ≤ a 2

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

b) Did you describe any potential participant risks, with links to Institutional Review

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]

a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

We propose an anchor-changing regularized natural policy gradient (ARNPG) framework in

c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We only report the mean in Figure 3 for clarity

If you used crowdsourcing or conducted research with human subjects

Field of Study

Computer ScienceMathematics

Journal Information

Name

ArXiv

Volume

abs/2005.00687

Venue Information

Name

Neural Information Processing Systems

Type

conference

URL

http://neurips.cc/

Alternate Names

Neural Inf Process Syst
NeurIPS
NIPS

TL;DR

Abstract

Authors

References49 items

Board

Towards Painless Policy Optimization for Constrained MDPs

Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach

TL;DR

Abstract

Authors

References49 items

Board

Towards Painless Policy Optimization for Constrained MDPs

Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Joint Optimization of Concave Scalarized Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm

Beyond Cumulative Returns via Reinforcement Learning over State-Action Occupancy Measures

On the Linear Convergence of Natural Policy Gradient Algorithm

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

Provable Multi-Objective Reinforcement Learning with Generative Models

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

First Order Constrained Optimization in Policy Space

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

A Multi-Objective Approach to Mitigate Negative Side Effects

Linear Last-iterate Convergence in Constrained Saddle-point Optimization

On the Global Convergence Rates of Softmax Policy Gradient Methods

Reinforcement Learning for Joint Optimization of Multiple Rewards

A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

Online Primal-Dual Mirror Descent under Stochastic Constraints

A Primal-Dual Parallel Method with $O(1/\epsilon)$ Convergence for Constrained Composite Convex Programs

OpenAI Gym

A Simple Parallel Algorithm with an O(1/t) Convergence Rate for General Convex Programs

Multi-Objective MDPs with Conditional Lexicographic Reward Preferences

Optimization, Learning, and Games with Predictable Sequences

A Survey of Multi-Objective Sequential Decision-Making

Scalarized multi-objective reinforcement learning: Novel design techniques

MuJoCo: A physics engine for model-based control

Mirror descent and nonlinear projected subgradient methods for convex optimization

Constrained Markov Decision Processes

Rate control for communication networks: shadow prices, proportional fairness and stability

Pareto Policy Adaptation

Fast Global Convergence of Policy Optimization for Constrained MDPs

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

First-order and Stochastic Optimization Methods for Machine Learning

Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems

Data Networks

Inequality (a) follows from the Cauchy-Schwarz inequality for the Ψ-norm; (b) from the triangle inequality; (c) from the smoothness of function Φ defined in (41); and (d) from ac + bc ≤ a 2

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

b) Did you describe any potential participant risks, with links to Institutional Review

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]

a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

We propose an anchor-changing regularized natural policy gradient (ARNPG) framework in

c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We only report the mean in Figure 3 for clarity

If you used crowdsourcing or conducted research with human subjects

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names