An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引：39

作者：

Meng, Wenjia ^{[1
]}

Zheng, Qian ^{[2
]}

Shi, Yue ^{[1
]}

Pan, Gang ^{[3
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China

[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore

[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2022年 / 33卷 / 05期

关键词：

Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;

D O I：

10.1109/TNNLS.2020.3044196

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.

引用

页码：2223 / 2235

页数：13

共 50 条

[41] Cautious policy programming: exploiting KL regularization for monotonic policy improvement in reinforcement learning
Zhu, Lingwei
Matsubara, Takamitsu
MACHINE LEARNING, 2023, 112 (11) : 4527 - 4562
[42] TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning
Shi, Longxiang
Li, Shijian
Cao, Longbing
Yang, Long
Pan, Gang
AAMAS '19: PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2019, : 1025 - 1032
[43] Fuzzy state aggregation and off-policy reinforcement learning for stochastic environments
Wardell, Dean C.
Peterson, Gilbert L.
PROCEEDINGS OF THE EIGHTH IASTED INTERNATIONAL CONFERENCE ON CONTROL AND APPLICATIONS, 2006, : 133 - +
[44] Off-Policy Meta-Reinforcement Learning With Belief-Based Task Inference
Imagawa, Takahisa
Hiraoka, Takuya
Tsuruoka, Yoshimasa
IEEE ACCESS, 2022, 10 : 49494 - 49507
[45] An off-policy deep reinforcement learning-based active learning for crime scene investigation image classification
Zhang, Yixin
Liu, Yang
Jiang, Guofan
Yang, Yuchen
Zhang, Jian
Jing, Yang
Roohallah, Alizadehsani
Ryszard, Tadeusiewicz
Pawel, Plawiak
INFORMATION SCIENCES, 2025, 710
[46] Off-policy deep reinforcement learning with automatic entropy adjustment for adaptive online grid emergency control
Zhang, Ying
Yue, Meng
Wang, Jianhui
ELECTRIC POWER SYSTEMS RESEARCH, 2023, 217
[47] An Off-policy maximum entropy deep reinforcement learning method for data-driven secondary frequency control of island microgrid
Huang, Xiangmin
Zeng, Jun
Wang, Tianlun
Zeng, Shunqi
APPLIED SOFT COMPUTING, 2025, 170
[48] High-Value Prioritized Experience Replay for Off-policy Reinforcement Learning
Cao, Xi
Wan, Huaiyu
Lin, Youfang
Han, Sheng
2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1510 - 1514
[49] A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning
Patterson, Andrew
White, Adam
White, Martha
JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
[50] Hyperparameter Tuning of an Off-Policy Reinforcement Learning Algorithm for H∞ Tracking Control
Farahmandi, Alireza
Reitz, Brian
Debord, Mark
Philbrick, Douglas
Estabridis, Katia
Hewer, Gary
LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211

← 1 2 3 4 5 →