An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引：39

作者：

Meng, Wenjia ^{[1
]}

Zheng, Qian ^{[2
]}

Shi, Yue ^{[1
]}

Pan, Gang ^{[3
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China

[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore

[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2022年 / 33卷 / 05期

关键词：

Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;

D O I：

10.1109/TNNLS.2020.3044196

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.

引用

页码：2223 / 2235

页数：13

共 50 条

[31] Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
Wang, Weiwei
Li, Yuqiang
Wu, Xianyi
STATISTICS AND COMPUTING, 2024, 34 (01)
[32] Off-Policy Conservative Distributional Reinforcement Learning With Safety Constraints
Zhang, Hengrui
Lin, Youfang
Han, Sheng
Wang, Shuo
Lv, Kai
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2025, 55 (03): : 2033 - 2045
[33] Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning
Liu, Feng
Dai, Shuling
Zhao, Yongjia
IEEE ACCESS, 2020, 8 : 228099 - 228107
[34] Traffic Signal Control Using End-to-End Off-Policy Deep Reinforcement Learning
Chu, Kai-Fung
Lam, Albert Y. S.
Li, Victor O. K.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (07) : 7184 - 7195
[35] Safe Off-Policy Deep Reinforcement Learning Algorithm for Volt-VAR Control in Power Distribution Systems
Wang, Wei
Yu, Nanpeng
Gao, Yuanqi
Shi, Jie
IEEE TRANSACTIONS ON SMART GRID, 2020, 11 (04) : 3008 - 3018
[36] Re-attentive experience replay in off-policy reinforcement learning
Wei, Wei
Wang, Da
Li, Lin
Liang, Jiye
MACHINE LEARNING, 2024, 113 (05) : 2327 - 2349
[37] Enhanced Strategies for Off-Policy Reinforcement Learning Algorithms in HVAC Control
Chen, Zhe
Jia, Qingshan
2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024, 2024, : 1691 - 1696
[38] Re-attentive experience replay in off-policy reinforcement learning
Wei Wei
Da Wang
Lin Li
Jiye Liang
Machine Learning, 2024, 113 : 2327 - 2349
[39] Model-free off-policy reinforcement learning in continuous environment
Wawrzynski, P
Pacut, A
2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2004, : 1091 - 1096
[40] Cautious policy programming: exploiting KL regularization for monotonic policy improvement in reinforcement learning
Lingwei Zhu
Takamitsu Matsubara
Machine Learning, 2023, 112 : 4527 - 4562

← 1 2 3 4 5 →