Authentic Boundary Proximal Policy Optimization

被引:28
作者
Cheng, Yuhu [1 ,2 ]
Huang, Longyang [1 ,2 ]
Wang, Xuesong [1 ,2 ]
机构
[1] China Univ Min & Technol, Engn Res Ctr Intelligent Control Underground Spac, Minist Educ, Xuzhou 221116, Jiangsu, Peoples R China
[2] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221116, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Linear programming; Optimization; Robots; Games; Reinforcement learning; Neural networks; Authentic boundary; penalized point policy difference; proximal policy optimization (PPO); reinforcement learning (RL); rollback clipping; REINFORCEMENT; SYSTEMS;
D O I
10.1109/TCYB.2021.3051456
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, the proximal policy optimization (PPO) algorithm has received considerable attention because of its excellent performance in many challenging tasks. However, there is still a large space for theoretical explanation of the mechanism of PPO's horizontal clipping operation, which is a key means to improve the performance of PPO. In addition, while PPO is inspired by the learning theory of trust region policy optimization (TRPO), the theoretical connection between PPO's clipping operation and TRPO's trust region constraint has not been well studied. In this article, we first analyze the effect of PPO's clipping operation on the objective function of conservative policy iteration, and strictly give the theoretical relationship between PPO and TRPO. Then, a novel first-order policy gradient algorithm called authentic boundary PPO (ABPPO) is proposed, which is based on the authentic boundary setting rule. To ensure the difference between the new and old policies is better kept within the clipping range, by borrowing the idea of ABPPO, we proposed two novel improved PPO algorithms called rollback mechanism-based ABPPO (RMABPPO) and penalized point policy difference-based ABPPO (P3DABPPO), which are based on the ideas of rollback clipping and penalized point policy difference, respectively. Experiments on the continuous robotic control tasks implemented in MuJoCo show that our proposed improved PPO algorithms can effectively improve the learning stability and accelerate the learning speed compared with the original PPO.
引用
收藏
页码:9428 / 9438
页数:11
相关论文
共 50 条
  • [21] Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization
    Huang, Chenping
    Cao, Bin
    COLLABORATIVE COMPUTING: NETWORKING, APPLICATIONS AND WORKSHARING, COLLABORATECOM 2022, PT I, 2022, 460 : 396 - 414
  • [22] A novel guidance law based on proximal policy optimization
    Jiang, Yang
    Yu, Jianglong
    Li, Qingdong
    Ren, Zhang
    Done, Xiwang
    Hua, Yongzhao
    2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 3364 - 3369
  • [23] A Novel Proximal Policy Optimization Approach for Filter Design
    Fan, Dongdong
    Ding, Shuai
    Zhang, Haotian
    Zhang, Weihao
    Jia, Qingsong
    Han, Xu
    Tang, Hao
    Zhu, Zhaojun
    Zhou, Yuliang
    APPLIED COMPUTATIONAL ELECTROMAGNETICS SOCIETY JOURNAL, 2024, 39 (05): : 390 - 395
  • [24] HiPPO: Enhancing proximal policy optimization with highlight replay
    Zhang, Shutong
    Chen, Xing
    Liu, Zhaogeng
    Chen, Hechang
    Chang, Yi
    PATTERN RECOGNITION, 2025, 162
  • [25] Proximal policy optimization with an integral compensator for quadrotor control
    Huan Hu
    Qing-ling Wang
    Frontiers of Information Technology & Electronic Engineering, 2020, 21 : 777 - 795
  • [26] DNA: Proximal Policy Optimization with a Dual Network Architecture
    Aitchison, Matthew
    Sweetser, Penny
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [27] Proximal policy optimization with an integral compensator for quadrotor control
    Hu, Huan
    Wang, Qing-ling
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2020, 21 (05) : 777 - 795
  • [28] An Object Recognition Grasping Approach Using Proximal Policy Optimization With YOLOv5
    Zheng, Qingchun
    Peng, Zhi
    Zhu, Peihao
    Zhao, Yangyang
    Zhai, Ran
    Ma, Wenpeng
    IEEE ACCESS, 2023, 11 : 87330 - 87343
  • [29] A proximal policy optimization with curiosity algorithm for virtual drone navigation
    Das, Rupayan
    Khan, Angshuman
    Paul, Gunjan
    ENGINEERING RESEARCH EXPRESS, 2024, 6 (01):
  • [30] Upper confident bound advantage function proximal policy optimization
    Xie, Guiliang
    Zhang, Wei
    Hu, Zhi
    Li, Gaojian
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2023, 26 (03): : 2001 - 2010