An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

被引：39

作者：

Meng, Wenjia ^{[1
]}

Zheng, Qian ^{[2
]}

Shi, Yue ^{[1
]}

Pan, Gang ^{[3
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China

[2] Nanyang Technol Univ, ROSE Lab, Singapore 637553, Singapore

[3] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2022年 / 33卷 / 05期

关键词：

Linear programming; TV; Reinforcement learning; Task analysis; Standards; Space stations; Optimization methods; Deep reinforcement learning; off-policy data; policy-based method; trust region; GAME; GO;

D O I：

10.1109/TNNLS.2020.3044196

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data.

引用

页码：2223 / 2235

页数：13

共 50 条

[21] Flexible Data Augmentation in Off-Policy Reinforcement Learning
Rak, Alexandra
Skrynnik, Alexey
Panov, Aleksandr I.
ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING (ICAISC 2021), PT I, 2021, 12854 : 224 - 235
[22] HiER: Highlight Experience Replay for Boosting Off-Policy Reinforcement Learning Agents
Horvath, Daniel
Martin, Jesus Bujalance
Erdos, Ferenc Gabor
Istenes, Zoltan
Moutarde, Fabien
IEEE ACCESS, 2024, 12 : 100102 - 100119
[23] Research on Off-Policy Evaluation in Reinforcement Learning: A Survey
Wang S.-R.
Niu W.-J.
Tong E.-D.
Chen T.
Li H.
Tian Y.-Z.
Liu J.-Q.
Han Z.
Li Y.-D.
Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (09): : 1926 - 1945
[24] Off-Policy Reinforcement Learning for H∞ Control Design
Luo, Biao
Wu, Huai-Ning
Huang, Tingwen
IEEE TRANSACTIONS ON CYBERNETICS, 2015, 45 (01) : 65 - 76
[25] Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning
Tian, Chang
Liu, An
Huang, Guan
Luo, Wu
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2022, 70 : 1609 - 1624
[26] An Off-Policy Reinforcement Learning-Based Adaptive Optimization Method for Dynamic Resource Allocation Problem
He, Baiyang
Meng, Ying
Tang, Lixin
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 36 (02) : 3504 - 3518
[27] Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
Duan, Jingliang
Guan, Yang
Li, Shengbo Eben
Ren, Yangang
Sun, Qi
Cheng, Bo
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6584 - 6598
[28] Off-policy correction algorithm for double Q network based on deep reinforcement learning
Zhang, Qingbo
Liu, Manlu
Wang, Heng
Qian, Weimin
Zhang, Xinglang
IET CYBER-SYSTEMS AND ROBOTICS, 2023, 5 (04)
[29] Safe Off-policy Reinforcement Learning Using Barrier Functions
Marvi, Zahra
Kiumarsi, Bahare
2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 2176 - 2181
[30] Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
Weiwei Wang
Yuqiang Li
Xianyi Wu
Statistics and Computing, 2024, 34

← 1 2 3 4 5 →