Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution (2020-09-29T00:00:00.000000Z)

TL;DR

Align-RUDDER is introduced, which is RUDDER with two major modifications, which replaces RUDder's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations, which considerably reduces the delay of rewards, thus speeding up learning.

Abstract

Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks can often be hierarchically decomposed into sub-tasks. A step in the Q-function can be associated with solving a sub-task, where the expectation of the return increases. RUDDER has been introduced to identify these steps and then redistribute reward to them, thus immediately giving reward if sub-tasks are solved. Since the problem of delayed rewards is mitigated, learning is considerably sped up. However, for complex tasks, current exploration strategies as deployed in RUDDER struggle with discovering episodes with high rewards. Therefore, we assume that episodes with high rewards are given as demonstrations and do not have to be discovered by exploration. Typically the number of demonstrations is small and RUDDER's LSTM model as a deep learning method does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations, replacing RUDDER's safe exploration and lessons replay buffer. Second, we replace RUDDER's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations. Profile models can be constructed from as few as two demonstrations as known from bioinformatics. Align-RUDDER inherits the concept of reward redistribution, which considerably reduces the delay of rewards, thus speeding up learning. Align-RUDDER outperforms competitors on complex artificial tasks with delayed reward and few demonstrations. On the MineCraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Github: this https URL, YouTube: this https URL

References126 items

Hopular: Modern Hopfield Networks for Tabular Data

History Compression via Language Models in Reinforcement Learning

A Globally Convergent Evolutionary Strategy for Stochastic Constrained Optimization with Applications to Reinforcement Learning

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

Cross-Domain Few-Shot Learning by Representation Fusion

Forgetful Experience Replay in Hierarchical Reinforcement Learning from Demonstrations

Playing Minecraft with Behavioural Cloning

Sample Efficient Reinforcement Learning through Learning from Demonstrations in Minecraft

Retrospective Analysis of the 2019 MineRL Competition on Sample Efficient Reinforcement Learning

Hierarchical Deep Q-Network with Forgetting from Imperfect Demonstrations in Minecraft

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Continuous Deep Maximum Entropy Inverse Reinforcement Learning using online POMDP

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

MineRL: A Large-Scale Dataset of Minecraft Demonstrations

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Watch, Try, Learn: Meta-Learning from Demonstrations and Reward

SQIL: Imitation Learning via Regularized Behavioral Cloning

Successor Options: An Option Discovery Framework for Reinforcement Learning

The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors

Go-Explore: a New Approach for Hard-Exploration Problems

Experience Replay for Continual Learning

Inverse reinforcement learning for video games

RUDDER: Return Decomposition for Delayed Rewards

Hierarchical Imitation and Reinforcement Learning

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Pretraining Deep Actor-Critic Reinforcement Learning Algorithms With Expert Demonstrations

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Eigenoption Discovery through the Deep Successor Representation

Meta Learning Shared Hierarchies

Rainbow: Combining Improvements in Deep Reinforcement Learning

Overcoming Exploration in Reinforcement Learning with Demonstrations

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

One-Shot Visual Imitation Learning via Meta-Learning

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Proximal Policy Optimization Algorithms

Deep Q-learning From Demonstrations

Learning from Demonstrations for Real World Reinforcement Learning

One-Shot Imitation Learning

Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction

FeUdal Networks for Hierarchical Reinforcement Learning

Time delays, competitive interdependence, and firm performance

The Option-Critic Architecture

Probabilistic inference for determining options in reinforcement learning

Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay

Successor Features for Transfer in Reinforcement Learning

Generative Adversarial Imitation Learning

Exploration from Demonstration for Interactive Reinforcement Learning

Learning from Demonstration for Shaping through Inverse Reinforcement Learning

Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

Adaptive Skills Adaptive Partitions (ASAP)

Mastering the game of Go with deep neural networks and tree search

Reinforcement Learning from Demonstration through Shaping

Clustal Omega

Learning from Limited Demonstrations

Compositional Planning Using Optimal Option Models

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega

Unified Inter and Intra Options Learning Using Policy Gradient Methods

Integrating reinforcement learning with human demonstrations of varying ability

Robot Programming by Demonstration

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Optimal policy switching algorithms for reinforcement learning

Efficient Reductions for Imitation Learning

Effects of feedback delay on learning

Active Learning for Reward Estimation in Inverse Reinforcement Learning

Robot Programming by Demonstration

Search-based structured prediction

Biopython: freely available Python tools for computational molecular biology and bioinformatics

Sequence Comparison: Theory and Methods

Maximum Entropy Inverse Reinforcement Learning

A Game-Theoretic Approach to Apprenticeship Learning

Exact finite approximations of average-cost countable Markov decision processes

Large-scale kernel machines

Matplotlib: A 2D Graphics Environment

Clustering by Passing Messages Between Data Points

Apprenticeship learning via inverse reinforcement learning

MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Learning Options in Reinforcement Learning

Approximately Optimal Approximate Reinforcement Learning

T-Coffee: A novel method for fast and accurate multiple sequence alignment.

Algorithms for Inverse Reinforcement Learning

Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Highly specific protein sequence motifs for genome analysis.

Long Short-Term Memory

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Learning from Demonstration

LSTM can Solve Hard Long Time Lag Problems

Learning to Take Actions

Learning from delayed rewards

On the Complexity of Multiple Sequence Alignment

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Building symbolic representations of intuitive real-time skills from performance data

PROSITE: recent developments.

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Improving Generalization for Temporal Difference Learning: The Successor Representation

Amino acid substitution matrices from protein blocks.

100

Efficient Training of Artificial Neural Networks for Autonomous Navigation

101

Basic local alignment search tool.

102

Statistical Composition of High-Scoring Segments from Molecular Sequences

103

Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

104

Multiple sequence alignment with hierarchical clustering.

105

An improved algorithm for matching biological sequences.

106

Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli.

107

Identification of common molecular subsequences.

108

Cases in which Parsimony or Compatibility Methods will be Positively Misleading

109

A linear space algorithm for computing maximal common subsequences

110

Understanding the Effects of Dataset Characteristics on Offline Reinforcement Learning

111

Hopﬁeld networks is all you need

112

XAI and Strategy Extraction via Reward Redistribution

113

A LIGN -RUDDER: L EARNING F ROM F EW D EMON - STRATIONS BY R EWARD R EDISTRIBUTION

114

Modern hopﬁeld networks and attention for immune repertoire classiﬁcation

115

Impala

116

mazelab: A customizable framework to create maze and gridworld environments

117

Openai

118

Active Imitation Learning: Formal and Practical Reductions to I.I.D. Learning

119

For more details

120

Reinforcement Learning: An Introduction

121

DIALIGN : multiple DNA and protein sequence alignment at BiBiServ

122

Untersuchungen zu dynamischen neuronalen Netzen

123

Cognitive models from subcognitive skills

124

Atlas of protein sequence and structure

125

From Few Demonstrations by Reward Redistribution –30 Mar press/ Few-shot learning by dimensionality reduction in gradient

126

A.15 Step (IV) Compute a position-speciﬁc scoring matrix (PSSM)