Authentic Boundary Proximal Policy Optimization

被引:28
作者
Cheng, Yuhu [1 ,2 ]
Huang, Longyang [1 ,2 ]
Wang, Xuesong [1 ,2 ]
机构
[1] China Univ Min & Technol, Engn Res Ctr Intelligent Control Underground Spac, Minist Educ, Xuzhou 221116, Jiangsu, Peoples R China
[2] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221116, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Linear programming; Optimization; Robots; Games; Reinforcement learning; Neural networks; Authentic boundary; penalized point policy difference; proximal policy optimization (PPO); reinforcement learning (RL); rollback clipping; REINFORCEMENT; SYSTEMS;
D O I
10.1109/TCYB.2021.3051456
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, the proximal policy optimization (PPO) algorithm has received considerable attention because of its excellent performance in many challenging tasks. However, there is still a large space for theoretical explanation of the mechanism of PPO's horizontal clipping operation, which is a key means to improve the performance of PPO. In addition, while PPO is inspired by the learning theory of trust region policy optimization (TRPO), the theoretical connection between PPO's clipping operation and TRPO's trust region constraint has not been well studied. In this article, we first analyze the effect of PPO's clipping operation on the objective function of conservative policy iteration, and strictly give the theoretical relationship between PPO and TRPO. Then, a novel first-order policy gradient algorithm called authentic boundary PPO (ABPPO) is proposed, which is based on the authentic boundary setting rule. To ensure the difference between the new and old policies is better kept within the clipping range, by borrowing the idea of ABPPO, we proposed two novel improved PPO algorithms called rollback mechanism-based ABPPO (RMABPPO) and penalized point policy difference-based ABPPO (P3DABPPO), which are based on the ideas of rollback clipping and penalized point policy difference, respectively. Experiments on the continuous robotic control tasks implemented in MuJoCo show that our proposed improved PPO algorithms can effectively improve the learning stability and accelerate the learning speed compared with the original PPO.
引用
收藏
页码:9428 / 9438
页数:11
相关论文
共 50 条
  • [31] Robust solar sail trajectories using proximal policy optimization
    Bianchi, Christian
    Niccolai, Lorenzo
    Mengali, Giovanni
    ACTA ASTRONAUTICA, 2025, 226 : 702 - 715
  • [32] Intelligent Control of a Quadrotor with Proximal Policy Optimization Reinforcement Learning
    Lopes, Guilherme Cano
    Ferreira, Murillo
    Simoes, Alexandre da Silva
    Colombini, Esther Luna
    15TH LATIN AMERICAN ROBOTICS SYMPOSIUM 6TH BRAZILIAN ROBOTICS SYMPOSIUM 9TH WORKSHOP ON ROBOTICS IN EDUCATION (LARS/SBR/WRE 2018), 2018, : 503 - 508
  • [33] An Empirical Investigation of Early Stopping Optimizations in Proximal Policy Optimization
    Dossa, Rousslan Fernand Julien
    Huang, Shengyi
    Ontanon, Santiago
    Matsubara, Takashi
    IEEE ACCESS, 2021, 9 : 117981 - 117992
  • [34] Entropy adjustment by interpolation for exploration in Proximal Policy Optimization (PPO)
    Boudlal, Ayoub
    Khafaji, Abderahim
    Elabbadi, Jamal
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [35] Implementing action mask in proximal policy optimization (PPO) algorithm
    Tang, Cheng-Yen
    Liu, Chien-Hung
    Chen, Woei-Kae
    You, Shingchern D.
    ICT EXPRESS, 2020, 6 (03): : 200 - 203
  • [36] Control of conventional continuous thickeners via proximal policy optimization
    Silva, Jonathan R.
    Euzebio, Thiago A. M.
    Braga, Marcio F.
    MINERALS ENGINEERING, 2024, 214
  • [37] Automated cloud resources provisioning with the use of the proximal policy optimization
    Włodzimierz Funika
    Paweł Koperek
    Jacek Kitowski
    The Journal of Supercomputing, 2023, 79 : 6674 - 6704
  • [38] An adversarial twin-agent inverse proximal policy optimization guided by model predictive control
    Gupta, Nikita
    Kandath, Harikumar
    Kodamana, Hariprasad
    COMPUTERS & CHEMICAL ENGINEERING, 2025, 199
  • [39] Relative Entropy of Correct Proximal Policy Optimization Algorithms with Modified Penalty Factor in Complex Environment
    Chen, Weimin
    Wong, Kelvin Kian Loong
    Long, Sifan
    Sun, Zhili
    ENTROPY, 2022, 24 (04)
  • [40] Optimization of cobalt oxalate synthesis process based on modified proximal policy optimization algorithm
    Jia R.-D.
    Ning W.-B.
    He D.-K.
    Chu F.
    Wang F.-L.
    Kongzhi yu Juece/Control and Decision, 2023, 38 (11): : 3075 - 3082