An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引:39
|
作者
Meng, Wenjia [1 ]
Zheng, Qian [2 ]
Shi, Yue [1 ]
Pan, Gang [3 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore
[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China
关键词
Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;
D O I
10.1109/TNNLS.2020.3044196
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.
引用
收藏
页码:2223 / 2235
页数:13
相关论文
共 50 条
  • [1] Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient
    Tosatto, Samuele
    Carvalho, Joao
    Peters, Jan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5996 - 6010
  • [2] Off-Policy Deep Reinforcement Learning Based on Steffensen Value Iteration
    Cheng, Yuhu
    Chen, Lin
    Chen, C. L. Philip
    Wang, Xuesong
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2021, 13 (04) : 1023 - 1032
  • [3] A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation
    Zhang, Huaqing
    Ma, Hongbin
    Mersha, Bemnet Wondimagegnehu
    Jin, Ying
    APPLIED INTELLIGENCE, 2024, 54 (21) : 11144 - 11159
  • [4] Efficient Off-Policy Safe Reinforcement Learning Using Trust Region Conditional Value At Risk
    Kim, Dohyeong
    Oh, Songhwai
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7644 - 7651
  • [5] Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay
    Kong, Seung-Hyun
    Nahrendra, I. Made Aswin
    Paek, Dong-Hee
    IEEE ACCESS, 2021, 9 (09): : 93152 - 93164
  • [6] Off-Policy Proximal Policy Optimization
    Meng, Wenjia
    Zheng, Qian
    Pan, Gang
    Yin, Yilong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9162 - 9170
  • [7] Off-policy and on-policy reinforcement learning with the Tsetlin machine
    Saeed Rahimi Gorji
    Ole-Christoffer Granmo
    Applied Intelligence, 2023, 53 : 8596 - 8613
  • [8] Off-policy and on-policy reinforcement learning with the Tsetlin machine
    Gorji, Saeed Rahimi
    Granmo, Ole-Christoffer
    APPLIED INTELLIGENCE, 2023, 53 (08) : 8596 - 8613
  • [9] Reliability assessment of off-policy deep reinforcement learning: A benchmark for aerodynamics
    Berger, Sandrine
    Ramo, Andrea Arroyo
    Guillet, Valentin
    Lahire, Thibault
    Martin, Brice
    Jardin, Thierry
    Rachelson, Emmanuel
    DATA-CENTRIC ENGINEERING, 2024, 5
  • [10] Off-Policy Differentiable Logic Reinforcement Learning
    Zhang, Li
    Li, Xin
    Wang, Mingzhong
    Tian, Andong
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: RESEARCH TRACK, PT II, 2021, 12976 : 617 - 632