Efficient Off-Policy Safe Reinforcement Learning Using Trust Region Conditional Value At Risk

被引:9
|
作者
Kim, Dohyeong [1 ,2 ]
Oh, Songhwai [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul 08826, South Korea
[2] Seoul Natl Univ, ASRI, Seoul 08826, South Korea
基金
新加坡国家研究基金会;
关键词
Reinforcement learning; robot safety; collision avoidance;
D O I
10.1109/LRA.2022.3184793
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
This letter aims to solve a safe reinforcement learning (RL) problem with risk measure-based constraints. As risk measures, such as conditional value at risk (CVaR), focus on the tail distribution of cost signals, constraining risk measures can effectively prevent a failure in the worst case. An on-policy safe RL method, called TRC, deals with a CVaR-constrained RL problem using a trust region method and can generate policies with almost zero constraint violations with high returns. However, to achieve outstanding performance in complex environments and satisfy safety constraints quickly, RL methods are required to be sample efficient. To this end, we propose an off-policy safe RL method with CVaR constraints, called off-policy TRC. If off-policy data from replay buffers is directly used to train TRC, the estimation error caused by the distributional shift results in performance degradation. To resolve this issue, we propose novel surrogate functions, in which the effect of the distributional shift can be reduced, and introduce an adaptive trust-region constraint to ensure a policy not to deviate far from replay buffers. The proposed method has been evaluated in simulation and real-world environments and satisfied safety constraints within a few steps while achieving high returns even in complex robotic tasks.
引用
收藏
页码:7644 / 7651
页数:8
相关论文
共 50 条
  • [1] TRC: Trust Region Conditional Value at Risk for Safe Reinforcement Learning
    Kim, Dohyeong
    Oh, Songhwai
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (02): : 2621 - 2628
  • [2] Safe Off-policy Reinforcement Learning Using Barrier Functions
    Marvi, Zahra
    Kiumarsi, Bahare
    2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 2176 - 2181
  • [3] An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning
    Meng, Wenjia
    Zheng, Qian
    Shi, Yue
    Pan, Gang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (05) : 2223 - 2235
  • [4] Off-Policy Deep Reinforcement Learning Based on Steffensen Value Iteration
    Cheng, Yuhu
    Chen, Lin
    Chen, C. L. Philip
    Wang, Xuesong
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2021, 13 (04) : 1023 - 1032
  • [5] Reliable Off-Policy Evaluation for Reinforcement Learning
    Wang, Jie
    Gao, Rui
    Zha, Hongyuan
    OPERATIONS RESEARCH, 2024, 72 (02) : 699 - 716
  • [6] Sequential Search with Off-Policy Reinforcement Learning
    Miao, Dadong
    Wang, Yanan
    Tang, Guoyu
    Liu, Lin
    Xu, Sulong
    Long, Bo
    Xiao, Yun
    Wu, Lingfei
    Jiang, Yunjiang
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4006 - 4015
  • [7] A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning
    Patterson, Andrew
    White, Adam
    White, Martha
    JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
  • [8] Off-policy model-based end-to-end safe reinforcement learning
    Kanso, Soha
    Jha, Mayank Shekhar
    Theilliol, Didier
    INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL, 2024, 34 (04) : 2806 - 2831
  • [9] Off-policy and on-policy reinforcement learning with the Tsetlin machine
    Saeed Rahimi Gorji
    Ole-Christoffer Granmo
    Applied Intelligence, 2023, 53 : 8596 - 8613
  • [10] Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient
    Tosatto, Samuele
    Carvalho, Joao
    Peters, Jan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5996 - 6010