Safe Q-Learning Method Based on Constrained Markov Decision Processes

被引:19
|
作者
Ge, Yangyang [1 ]
Zhu, Fei [1 ,2 ]
Lin, Xinghong [1 ]
Liu, Quan [1 ]
机构
[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China
[2] Soochow Univ, Prov Key Lab Comp Informat Proc Technol, Suzhou 215006, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
基金
中国国家自然科学基金;
关键词
Constrained Markov decision processes; safe reinforcement learning; Q-learning; constraint; Lagrange multiplier; REINFORCEMENT; OPTIMIZATION; ALGORITHM;
D O I
10.1109/ACCESS.2019.2952651
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The application of reinforcement learning in industrial fields makes the safety problem of the agent a research hotspot. Traditional methods mainly alter the objective function and the exploration process of the agent to address the safety problem. Those methods, however, can hardly prevent the agent from falling into dangerous states because most of the methods ignore the damage caused by unsafe states. As a result, most solutions are not satisfactory. In order to solve the aforementioned problem, we come forward with a safe Q-learning method that is based on constrained Markov decision processes, adding safety constraints as prerequisites to the model, which improves standard Q-learning algorithm so that the proposed algorithm seeks for the optimal solution ensuring that the safety premise is satisfied. During the process of finding the solution in form of the optimal state-action value, the feasible space of the agent is limited to the safe space that guarantees the safety via the feasible space being filtered by constraints added to the action space. Because the traditional solution methods are not applicable to the safe Q-learning model as they tend to obtain local optimal solution, we take advantage of the Lagrange multiplier method to solve the optimal action that can be performed in the current state based on the premise of linearizing constraint functions, which not only improves the efficiency and accuracy of the algorithm, but also guarantees to obtain the global optimal solution. The experiments verify the effectiveness of the algorithm.
引用
收藏
页码:165007 / 165017
页数:11
相关论文
共 50 条
  • [11] SEM: Safe exploration mask for q-learning
    Xuan, Chengbin
    Zhang, Feng
    Lam, Hak-Keung
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 111
  • [12] Risk-Constrained Markov Decision Processes
    Borkar, Vivek
    Jain, Rahul
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2014, 59 (09) : 2574 - 2579
  • [13] Safe Reinforcement Learning for Arm Manipulation with Constrained Markov Decision Process
    Adjei, Patrick
    Tasfi, Norman
    Gomez-Rosero, Santiago
    Capretz, Miriam A. M.
    ROBOTICS, 2024, 13 (04)
  • [14] Exploiting the structural properties of the underlying Markov decision problem in the Q-learning algorithm
    Kunnumkal, Sumit
    Topaloglu, Huseyin
    INFORMS JOURNAL ON COMPUTING, 2008, 20 (02) : 288 - 301
  • [15] Potential based optimization algorithm of constrained Markov decision processes
    Li Yanjie
    Yin Baoqun
    Xi Hongsheng
    Proceedings of the 24th Chinese Control Conference, Vols 1 and 2, 2005, : 433 - 436
  • [16] Constrained Q-Learning for Batch Process Optimization
    Pan, Elton
    Petsagkourakis, Panagiotis
    Mowbray, Max
    Zhang, Dongda
    del Rio-Chanona, Antonio
    IFAC PAPERSONLINE, 2021, 54 (03): : 492 - 497
  • [17] Model based path planning using Q-Learning
    Sharma, Avinash
    Gupta, Kanika
    Kumar, Anirudha
    Sharma, Aishwarya
    Kumar, Rajesh
    2017 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL TECHNOLOGY (ICIT), 2017, : 837 - 842
  • [18] Dominance-constrained Markov decision processes
    Haskell, William B.
    Jain, Rahul
    2012 IEEE 51ST ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2012, : 5991 - 5996
  • [19] Constrained Markov decision processes with uncertain costs
    Varagapriya, V.
    Singh, Vikas Vikram
    Lisser, Abdel
    OPERATIONS RESEARCH LETTERS, 2022, 50 (02) : 218 - 223
  • [20] Markov decision processes with constrained stopping times
    Horiguchi, M
    Kurano, M
    Yasuda, M
    PROCEEDINGS OF THE 39TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 2000, : 706 - 710