An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引:39
|
作者
Meng, Wenjia [1 ]
Zheng, Qian [2 ]
Shi, Yue [1 ]
Pan, Gang [3 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore
[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China
关键词
Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;
D O I
10.1109/TNNLS.2020.3044196
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.
引用
收藏
页码:2223 / 2235
页数:13
相关论文
共 50 条
  • [11] Off-policy asymptotic and adaptive maximum entropy deep reinforcement learning
    Zhang, Huihui
    Han, Xu
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2025, 16 (04) : 2417 - 2429
  • [12] Reliable Off-Policy Evaluation for Reinforcement Learning
    Wang, Jie
    Gao, Rui
    Zha, Hongyuan
    OPERATIONS RESEARCH, 2024, 72 (02) : 699 - 716
  • [13] Sequential Search with Off-Policy Reinforcement Learning
    Miao, Dadong
    Wang, Yanan
    Tang, Guoyu
    Liu, Lin
    Xu, Sulong
    Long, Bo
    Xiao, Yun
    Wu, Lingfei
    Jiang, Yunjiang
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4006 - 4015
  • [14] Model-Based Off-Policy Deep Reinforcement Learning With Model-Embedding
    Tan, Xiaoyu
    Qu, Chao
    Xiong, Junwu
    Zhang, James
    Qiu, Xihe
    Jin, Yaochu
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (04): : 2974 - 2986
  • [15] Off-Policy Reinforcement Learning with Loss Function Weighted by Temporal Difference Error
    Park, Bumgeun
    Kim, Taeyoung
    Moon, Woohyeon
    Nengroo, Sarvar Hussain
    Har, Dongsoo
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT V, 2023, 14090 : 600 - 613
  • [16] Benchmarking Off-Policy Deep Reinforcement Learning Algorithms for UAV Path Planning
    Garg, Shaswat
    Masnavi, Houman
    Fidan, Baris
    Janabi-Sharifi, Farrokh
    Mantegh, Iraj
    2024 INTERNATIONAL CONFERENCE ON UNMANNED AIRCRAFT SYSTEMS, ICUAS, 2024, : 317 - 323
  • [17] Deep Off-Policy Iterative Learning Control
    Gurumurthy, Swaminathan
    Kolter, J. Zico
    Manchester, Zachary
    LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211
  • [18] Counterfactual experience augmented off-policy reinforcement learning
    Lee, Sunbowen
    Gong, Yicheng
    Deng, Chao
    NEUROCOMPUTING, 2025, 637
  • [19] Z-Score Experience Replay in Off-Policy Deep Reinforcement Learning
    Yang, Yana
    Xi, Meng
    Dai, Huiao
    Wen, Jiabao
    Yang, Jiachen
    SENSORS, 2024, 24 (23)
  • [20] Mixed experience sampling for off-policy reinforcement learning
    Yu, Jiayu
    Li, Jingyao
    Lu, Shuai
    Han, Shuai
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 251