An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引:39
|
作者
Meng, Wenjia [1 ]
Zheng, Qian [2 ]
Shi, Yue [1 ]
Pan, Gang [3 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore
[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China
关键词
Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;
D O I
10.1109/TNNLS.2020.3044196
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.
引用
收藏
页码:2223 / 2235
页数:13
相关论文
共 50 条
  • [31] Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
    Wang, Weiwei
    Li, Yuqiang
    Wu, Xianyi
    STATISTICS AND COMPUTING, 2024, 34 (01)
  • [32] Off-Policy Conservative Distributional Reinforcement Learning With Safety Constraints
    Zhang, Hengrui
    Lin, Youfang
    Han, Sheng
    Wang, Shuo
    Lv, Kai
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2025, 55 (03): : 2033 - 2045
  • [33] Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning
    Liu, Feng
    Dai, Shuling
    Zhao, Yongjia
    IEEE ACCESS, 2020, 8 : 228099 - 228107
  • [34] Traffic Signal Control Using End-to-End Off-Policy Deep Reinforcement Learning
    Chu, Kai-Fung
    Lam, Albert Y. S.
    Li, Victor O. K.
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (07) : 7184 - 7195
  • [35] Safe Off-Policy Deep Reinforcement Learning Algorithm for Volt-VAR Control in Power Distribution Systems
    Wang, Wei
    Yu, Nanpeng
    Gao, Yuanqi
    Shi, Jie
    IEEE TRANSACTIONS ON SMART GRID, 2020, 11 (04) : 3008 - 3018
  • [36] Re-attentive experience replay in off-policy reinforcement learning
    Wei, Wei
    Wang, Da
    Li, Lin
    Liang, Jiye
    MACHINE LEARNING, 2024, 113 (05) : 2327 - 2349
  • [37] Enhanced Strategies for Off-Policy Reinforcement Learning Algorithms in HVAC Control
    Chen, Zhe
    Jia, Qingshan
    2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024, 2024, : 1691 - 1696
  • [38] Re-attentive experience replay in off-policy reinforcement learning
    Wei Wei
    Da Wang
    Lin Li
    Jiye Liang
    Machine Learning, 2024, 113 : 2327 - 2349
  • [39] Model-free off-policy reinforcement learning in continuous environment
    Wawrzynski, P
    Pacut, A
    2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2004, : 1091 - 1096
  • [40] Cautious policy programming: exploiting KL regularization for monotonic policy improvement in reinforcement learning
    Lingwei Zhu
    Takamitsu Matsubara
    Machine Learning, 2023, 112 : 4527 - 4562