Safe Exploration Algorithms for Reinforcement Learning Controllers

被引：74

作者：

Mannucci, Tommaso ^{[1
]}

van Kampen, Erik-Jan ^{[1
]}

de Visser, Cornelis ^{[1
]}

Chu, Qiping ^{[1
]}

机构：

[1] Delft Univ Technol, Fac Aerosp Engn, Control & Simulat Div, NL-2629 HS Delft, Netherlands

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2018年 / 29卷 / 04期

关键词：

Adaptive controllers; model-free control; reinforcement learning (RL); safe exploration; DISCRETE-TIME; NONLINEAR-SYSTEMS; CONSTRAINTS; NETWORKS; LYAPUNOV; DESIGN; ROBOTS;

D O I：

10.1109/TNNLS.2017.2654539

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-learning approaches, such as reinforcement learning, offer new possibilities for autonomous control of uncertain or time-varying systems. However, exploring an unknown environment under limited prediction capabilities is a challenge for a learning agent. If the environment is dangerous, free exploration can result in physical damage or in an otherwise unacceptable behavior. With respect to existing methods, the main contribution of this paper is the definition of a new approach that does not require global safety functions, nor specific formulations of the dynamics or of the environment, but relies on interval estimation of the dynamics of the agent during the exploration phase, assuming a limited capability of the agent to perceive the presence of incoming fatal states. Two algorithms are presented with this approach. The first is the Safety Handling Exploration with Risk Perception Algorithm (SHERPA), which provides safety by individuating temporary safety functions, called backups. SHERPA is shown in a simulated, simplified quadrotor task, for which dangerous states are avoided. The second algorithm, denominated OptiSHERPA, can safely handle more dynamically complex systems for which SHERPA is not sufficient through the use of safety metrics. An application of OptiSHERPA is simulated on an aircraft altitude control task.

引用

页码：1069 / 1081

页数：13

共 47 条

[1]

[Anonymous], 1994, P 11 INT C MACH LEAR

[2]

[Anonymous], 2007, IFAC P

[3]

[Anonymous], 1996, Neuro-dynamic programming

[4]

[Anonymous], 2020, Reinforcement Learning, An Introduction

[5]

[Anonymous], 2010, Algorithms for Reinforcement Learning

[6] A survey of robot learning from demonstration [J].

Argall, Brenna D. ;

Chernova, Sonia ;

Veloso, Manuela ;

Browning, Brett .

ROBOTICS AND AUTONOMOUS SYSTEMS, 2009, 57 (05) :469-483

[7] Cognitive navigation based on nonuniform gabor space sampling, unsupervised growing networks, and reinforcement learning [J].

Arleo, A ;

Smeraldi, F ;

Gerstner, W .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 2004, 15 (03) :639-652

[8]

Bellman R. E., 1957, Dynamic programming. Princeton landmarks in mathematics

[9] Neural networks for continuous online learning and control [J].

Choy, Min Chee ;

Srinivasan, Dipti ;

Cheu, Ruey Long .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 2006, 17 (06) :1511-1531

[10] Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes [J].

Coraluppi, SP ;

Marcus, SI .

AUTOMATICA, 1999, 35 (02) :301-309

← 1 2 3 4 5 →