An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引:39
|
作者
Meng, Wenjia [1 ]
Zheng, Qian [2 ]
Shi, Yue [1 ]
Pan, Gang [3 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore
[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China
关键词
Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;
D O I
10.1109/TNNLS.2020.3044196
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.
引用
收藏
页码:2223 / 2235
页数:13
相关论文
共 50 条
  • [41] Cautious policy programming: exploiting KL regularization for monotonic policy improvement in reinforcement learning
    Zhu, Lingwei
    Matsubara, Takamitsu
    MACHINE LEARNING, 2023, 112 (11) : 4527 - 4562
  • [42] TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning
    Shi, Longxiang
    Li, Shijian
    Cao, Longbing
    Yang, Long
    Pan, Gang
    AAMAS '19: PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2019, : 1025 - 1032
  • [43] Fuzzy state aggregation and off-policy reinforcement learning for stochastic environments
    Wardell, Dean C.
    Peterson, Gilbert L.
    PROCEEDINGS OF THE EIGHTH IASTED INTERNATIONAL CONFERENCE ON CONTROL AND APPLICATIONS, 2006, : 133 - +
  • [44] Off-Policy Meta-Reinforcement Learning With Belief-Based Task Inference
    Imagawa, Takahisa
    Hiraoka, Takuya
    Tsuruoka, Yoshimasa
    IEEE ACCESS, 2022, 10 : 49494 - 49507
  • [45] An off-policy deep reinforcement learning-based active learning for crime scene investigation image classification
    Zhang, Yixin
    Liu, Yang
    Jiang, Guofan
    Yang, Yuchen
    Zhang, Jian
    Jing, Yang
    Roohallah, Alizadehsani
    Ryszard, Tadeusiewicz
    Pawel, Plawiak
    INFORMATION SCIENCES, 2025, 710
  • [46] Off-policy deep reinforcement learning with automatic entropy adjustment for adaptive online grid emergency control
    Zhang, Ying
    Yue, Meng
    Wang, Jianhui
    ELECTRIC POWER SYSTEMS RESEARCH, 2023, 217
  • [47] An Off-policy maximum entropy deep reinforcement learning method for data-driven secondary frequency control of island microgrid
    Huang, Xiangmin
    Zeng, Jun
    Wang, Tianlun
    Zeng, Shunqi
    APPLIED SOFT COMPUTING, 2025, 170
  • [48] High-Value Prioritized Experience Replay for Off-policy Reinforcement Learning
    Cao, Xi
    Wan, Huaiyu
    Lin, Youfang
    Han, Sheng
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1510 - 1514
  • [49] A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning
    Patterson, Andrew
    White, Adam
    White, Martha
    JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
  • [50] Hyperparameter Tuning of an Off-Policy Reinforcement Learning Algorithm for H∞ Tracking Control
    Farahmandi, Alireza
    Reitz, Brian
    Debord, Mark
    Philbrick, Douglas
    Estabridis, Katia
    Hewer, Gary
    LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211