An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引：39

作者：

Meng, Wenjia ^{[1
]}

Zheng, Qian ^{[2
]}

Shi, Yue ^{[1
]}

Pan, Gang ^{[3
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China

[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore

[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2022年 / 33卷 / 05期

关键词：

Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;

D O I：

10.1109/TNNLS.2020.3044196

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.

引用

页码：2223 / 2235

页数：13

共 50 条

[11] Off-policy asymptotic and adaptive maximum entropy deep reinforcement learning
Zhang, Huihui
Han, Xu
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2025, 16 (04) : 2417 - 2429
[12] Reliable Off-Policy Evaluation for Reinforcement Learning
Wang, Jie
Gao, Rui
Zha, Hongyuan
OPERATIONS RESEARCH, 2024, 72 (02) : 699 - 716
[13] Sequential Search with Off-Policy Reinforcement Learning
Miao, Dadong
Wang, Yanan
Tang, Guoyu
Liu, Lin
Xu, Sulong
Long, Bo
Xiao, Yun
Wu, Lingfei
Jiang, Yunjiang
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4006 - 4015
[14] Model-Based Off-Policy Deep Reinforcement Learning With Model-Embedding
Tan, Xiaoyu
Qu, Chao
Xiong, Junwu
Zhang, James
Qiu, Xihe
Jin, Yaochu
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (04): : 2974 - 2986
[15] Off-Policy Reinforcement Learning with Loss Function Weighted by Temporal Difference Error
Park, Bumgeun
Kim, Taeyoung
Moon, Woohyeon
Nengroo, Sarvar Hussain
Har, Dongsoo
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT V, 2023, 14090 : 600 - 613
[16] Benchmarking Off-Policy Deep Reinforcement Learning Algorithms for UAV Path Planning
Garg, Shaswat
Masnavi, Houman
Fidan, Baris
Janabi-Sharifi, Farrokh
Mantegh, Iraj
2024 INTERNATIONAL CONFERENCE ON UNMANNED AIRCRAFT SYSTEMS, ICUAS, 2024, : 317 - 323
[17] Deep Off-Policy Iterative Learning Control
Gurumurthy, Swaminathan
Kolter, J. Zico
Manchester, Zachary
LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211
[18] Counterfactual experience augmented off-policy reinforcement learning
Lee, Sunbowen
Gong, Yicheng
Deng, Chao
NEUROCOMPUTING, 2025, 637
[19] Z-Score Experience Replay in Off-Policy Deep Reinforcement Learning
Yang, Yana
Xi, Meng
Dai, Huiao
Wen, Jiabao
Yang, Jiachen
SENSORS, 2024, 24 (23)
[20] Mixed experience sampling for off-policy reinforcement learning
Yu, Jiayu
Li, Jingyao
Lu, Shuai
Han, Shuai
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 251

← 1 2 3 4 5 →