An Actor-Critic Framework for Online Control With Environment Stability Guarantee

被引:0
作者
Osinenko, Pavel [1 ]
Yaremenko, Grigory [1 ]
Malaniya, Georgiy [1 ]
Bolychev, Anton [1 ]
机构
[1] Skolkovo Inst Sci & Technol, Moscow 121205, Russia
关键词
Reinforcement learning; Predictive control; Stability analysis; Lyapunov methods; Safety; Costs; Training; Control; stabilization; reinforcement learning; Lyapunov function; MODEL-PREDICTIVE CONTROL; CONTINUOUS-TIME; NONLINEAR-SYSTEMS; REINFORCEMENT; MPC; OPTIMIZATION; ALGORITHM; NETWORKS; THEOREM; SAFE;
D O I
10.1109/ACCESS.2023.3306070
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Online actor-critic reinforcement learning is concerned with training an agent on-the-fly via dynamic interaction with the environment. Due to the specifics of the application, it is not generally possible to perform long pre-training, as it is commonly done in off-line, tabular or Monte-Carlo mode. Such applications may be found more frequently in industry, rather than in pure digital fields, such as cloud services, video games, database management, etc., where reinforcement learning has been demonstrating success. Stability of the closed-loop of the agent plus the environment is a major challenge here, and not only in terms of the environment safety and integrity, but also in terms of sparing resources on failed training episodes. In this paper, we tackle the problem of environment stability under an actor-critic reinforcement learning agent by integration of the Lyapunov stability theory tools. Under the presented approach, the closed-loop stability is secured in all episodes without pre-training. It was observed in a case study with a mobile robot that the suggested agent could always successfully achieve the control goal, while significantly reducing the cost. While many approaches may be exploited for mobile robot control, we suggest that the experiments showed the promising potential of actor-critic reinforcement learning agents based on Lyapunov-like constraints. The presented methodology may be utilized in safety-critical, industrial applications where stability is necessary.
引用
收藏
页码:89188 / 89204
页数:17
相关论文
共 108 条
[1]  
Akkaya I, 2019, Arxiv, DOI arXiv:1910.07113
[2]   Trajectory Optimization for Full-Body Movements with Complex Contacts [J].
Al Borno, Mazen ;
de Lasa, Martin ;
Hertzmann, Aaron .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2013, 19 (08) :1405-1414
[3]   Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof [J].
Al-Tamimi, Asma ;
Lewis, Frank L. ;
Abu-Khalaf, Murad .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2008, 38 (04) :943-949
[4]  
Amos Brandon, 2018, Advances in Neural Information Processing Systems, V31
[5]   Infinite-horizon policy-gradient estimation [J].
Baxter, J ;
Bartlett, PL .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 :319-350
[6]  
Beckenbach L, 2020, 2020 EUROPEAN CONTROL CONFERENCE (ECC 2020), P184
[7]  
Beckenbach L, 2019, IEEE DECIS CONTR P, P7110, DOI 10.1109/CDC40024.2019.9030185
[8]  
Beckenbach L, 2018, 2018 EUROPEAN CONTROL CONFERENCE (ECC), P1349, DOI 10.23919/ECC.2018.8550545
[9]   Addressing infinite-horizon optimization in MPC via Q-learning [J].
Beckenbach, Lukas ;
Osinenko, Pavel ;
Streif, Stefan .
IFAC PAPERSONLINE, 2018, 51 (20) :60-65
[10]  
Berkenkamp F., 2019, Safe exploration in reinforcement learning: Theory and applications in robotics