An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引：39

作者：

Meng, Wenjia ^{[1
]}

Zheng, Qian ^{[2
]}

Shi, Yue ^{[1
]}

Pan, Gang ^{[3
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China

[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore

[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2022年 / 33卷 / 05期

关键词：

Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;

D O I：

10.1109/TNNLS.2020.3044196

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.

引用

页码：2223 / 2235

页数：13

共 50 条

[1] Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient
Tosatto, Samuele
Carvalho, Joao
Peters, Jan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5996 - 6010
[2] Off-Policy Deep Reinforcement Learning Based on Steffensen Value Iteration
Cheng, Yuhu
Chen, Lin
Chen, C. L. Philip
Wang, Xuesong
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2021, 13 (04) : 1023 - 1032
[3] A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation
Zhang, Huaqing
Ma, Hongbin
Mersha, Bemnet Wondimagegnehu
Jin, Ying
APPLIED INTELLIGENCE, 2024, 54 (21) : 11144 - 11159
[4] Efficient Off-Policy Safe Reinforcement Learning Using Trust Region Conditional Value At Risk
Kim, Dohyeong
Oh, Songhwai
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7644 - 7651
[5] Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay
Kong, Seung-Hyun
Nahrendra, I. Made Aswin
Paek, Dong-Hee
IEEE ACCESS, 2021, 9 (09): : 93152 - 93164
[6] Off-Policy Proximal Policy Optimization
Meng, Wenjia
Zheng, Qian
Pan, Gang
Yin, Yilong
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9162 - 9170
[7] Off-policy and on-policy reinforcement learning with the Tsetlin machine
Saeed Rahimi Gorji
Ole-Christoffer Granmo
Applied Intelligence, 2023, 53 : 8596 - 8613
[8] Off-policy and on-policy reinforcement learning with the Tsetlin machine
Gorji, Saeed Rahimi
Granmo, Ole-Christoffer
APPLIED INTELLIGENCE, 2023, 53 (08) : 8596 - 8613
[9] Reliability assessment of off-policy deep reinforcement learning: A benchmark for aerodynamics
Berger, Sandrine
Ramo, Andrea Arroyo
Guillet, Valentin
Lahire, Thibault
Martin, Brice
Jardin, Thierry
Rachelson, Emmanuel
DATA-CENTRIC ENGINEERING, 2024, 5
[10] Off-Policy Differentiable Logic Reinforcement Learning
Zhang, Li
Li, Xin
Wang, Mingzhong
Tian, Andong
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: RESEARCH TRACK, PT II, 2021, 12976 : 617 - 632

← 1 2 3 4 5 →