An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引:39
|
作者
Meng, Wenjia [1 ]
Zheng, Qian [2 ]
Shi, Yue [1 ]
Pan, Gang [3 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore
[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China
关键词
Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;
D O I
10.1109/TNNLS.2020.3044196
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.
引用
收藏
页码:2223 / 2235
页数:13
相关论文
共 50 条
  • [21] Flexible Data Augmentation in Off-Policy Reinforcement Learning
    Rak, Alexandra
    Skrynnik, Alexey
    Panov, Aleksandr I.
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING (ICAISC 2021), PT I, 2021, 12854 : 224 - 235
  • [22] HiER: Highlight Experience Replay for Boosting Off-Policy Reinforcement Learning Agents
    Horvath, Daniel
    Martin, Jesus Bujalance
    Erdos, Ferenc Gabor
    Istenes, Zoltan
    Moutarde, Fabien
    IEEE ACCESS, 2024, 12 : 100102 - 100119
  • [23] Research on Off-Policy Evaluation in Reinforcement Learning: A Survey
    Wang S.-R.
    Niu W.-J.
    Tong E.-D.
    Chen T.
    Li H.
    Tian Y.-Z.
    Liu J.-Q.
    Han Z.
    Li Y.-D.
    Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (09): : 1926 - 1945
  • [24] Off-Policy Reinforcement Learning for H∞ Control Design
    Luo, Biao
    Wu, Huai-Ning
    Huang, Tingwen
    IEEE TRANSACTIONS ON CYBERNETICS, 2015, 45 (01) : 65 - 76
  • [25] Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning
    Tian, Chang
    Liu, An
    Huang, Guan
    Luo, Wu
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2022, 70 : 1609 - 1624
  • [26] An Off-Policy Reinforcement Learning-Based Adaptive Optimization Method for Dynamic Resource Allocation Problem
    He, Baiyang
    Meng, Ying
    Tang, Lixin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 36 (02) : 3504 - 3518
  • [27] Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
    Duan, Jingliang
    Guan, Yang
    Li, Shengbo Eben
    Ren, Yangang
    Sun, Qi
    Cheng, Bo
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6584 - 6598
  • [28] Off-policy correction algorithm for double Q network based on deep reinforcement learning
    Zhang, Qingbo
    Liu, Manlu
    Wang, Heng
    Qian, Weimin
    Zhang, Xinglang
    IET CYBER-SYSTEMS AND ROBOTICS, 2023, 5 (04)
  • [29] Safe Off-policy Reinforcement Learning Using Barrier Functions
    Marvi, Zahra
    Kiumarsi, Bahare
    2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 2176 - 2181
  • [30] Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
    Weiwei Wang
    Yuqiang Li
    Xianyi Wu
    Statistics and Computing, 2024, 34