Efficient Off-Policy Safe Reinforcement Learning Using Trust Region Conditional Value At Risk

被引：9

作者：

Kim, Dohyeong ^{[1
,2
]}

Oh, Songhwai ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul 08826, South Korea

[2] Seoul Natl Univ, ASRI, Seoul 08826, South Korea

来源：

IEEE ROBOTICS AND AUTOMATION LETTERS | 2022年 / 7卷 / 03期

基金：

新加坡国家研究基金会;

关键词：

Reinforcement learning; robot safety; collision avoidance;

D O I：

10.1109/LRA.2022.3184793

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

This letter aims to solve a safe reinforcement learning (RL) problem with risk measure-based constraints. As risk measures, such as conditional value at risk (CVaR), focus on the tail distribution of cost signals, constraining risk measures can effectively prevent a failure in the worst case. An on-policy safe RL method, called TRC, deals with a CVaR-constrained RL problem using a trust region method and can generate policies with almost zero constraint violations with high returns. However, to achieve outstanding performance in complex environments and satisfy safety constraints quickly, RL methods are required to be sample efficient. To this end, we propose an off-policy safe RL method with CVaR constraints, called off-policy TRC. If off-policy data from replay buffers is directly used to train TRC, the estimation error caused by the distributional shift results in performance degradation. To resolve this issue, we propose novel surrogate functions, in which the effect of the distributional shift can be reduced, and introduce an adaptive trust-region constraint to ensure a policy not to deviate far from replay buffers. The proposed method has been evaluated in simulation and real-world environments and satisfied safety constraints within a few steps while achieving high returns even in complex robotic tasks.

引用

页码：7644 / 7651

页数：8

共 50 条

[1] TRC: Trust Region Conditional Value at Risk for Safe Reinforcement Learning
Kim, Dohyeong
Oh, Songhwai
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (02): : 2621 - 2628
[2] Safe Off-policy Reinforcement Learning Using Barrier Functions
Marvi, Zahra
Kiumarsi, Bahare
2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 2176 - 2181
[3] An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning
Meng, Wenjia
Zheng, Qian
Shi, Yue
Pan, Gang
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (05) : 2223 - 2235
[4] Off-Policy Deep Reinforcement Learning Based on Steffensen Value Iteration
Cheng, Yuhu
Chen, Lin
Chen, C. L. Philip
Wang, Xuesong
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2021, 13 (04) : 1023 - 1032
[5] Reliable Off-Policy Evaluation for Reinforcement Learning
Wang, Jie
Gao, Rui
Zha, Hongyuan
OPERATIONS RESEARCH, 2024, 72 (02) : 699 - 716
[6] Sequential Search with Off-Policy Reinforcement Learning
Miao, Dadong
Wang, Yanan
Tang, Guoyu
Liu, Lin
Xu, Sulong
Long, Bo
Xiao, Yun
Wu, Lingfei
Jiang, Yunjiang
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4006 - 4015
[7] A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning
Patterson, Andrew
White, Adam
White, Martha
JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
[8] Off-policy model-based end-to-end safe reinforcement learning
Kanso, Soha
Jha, Mayank Shekhar
Theilliol, Didier
INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL, 2024, 34 (04) : 2806 - 2831
[9] Off-policy and on-policy reinforcement learning with the Tsetlin machine
Saeed Rahimi Gorji
Ole-Christoffer Granmo
Applied Intelligence, 2023, 53 : 8596 - 8613
[10] Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient
Tosatto, Samuele
Carvalho, Joao
Peters, Jan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5996 - 6010

← 1 2 3 4 5 →