Proximal Policy Optimization with Mixed Distributed Training

被引:16
作者
Zhang, Zhenyu [2 ]
Luo, Xiangfeng [1 ,2 ]
Liu, Tong [2 ]
Xie, Shaorong [2 ]
Wang, Jianshu [2 ]
Wang, Wei [2 ]
Li, Yang [2 ]
Peng, Yan [2 ]
机构
[1] Shanghai Univ, Shanghai Inst Adv Commun & Data Sci, Shanghai, Peoples R China
[2] Shanghai Univ, Sch Comp Engn & Sci, Shanghai, Peoples R China
来源
2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019) | 2019年
基金
中国国家自然科学基金;
关键词
machine learning; reinforcement learning; distributed system;
D O I
10.1109/ICTAI.2019.00206
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Instability and slowness are two main problems in deep reinforcement learning. Even if proximal policy optimization (PPO) is the state of the art, it still suffers from these two problems. We introduce an improved algorithm based on proximal policy optimization, mixed distributed proximal policy optimization (MDPPO), and show that it can accelerate and stabilize the training process. In our algorithm, multiple different policies train simultaneously and each of them controls several identical agents that interact with environments. Actions are sampled by each policy separately as usual, but the trajectories for the training process are collected from all agents, instead of only one policy. We find that if we choose some auxiliary trajectories elaborately to train policies, the algorithm will be more stable and quicker to converge especially in the environments with sparse rewards.
引用
收藏
页码:1452 / 1456
页数:5
相关论文
共 21 条
  • [1] [Anonymous], 2018, SOFT ACTOR CRITIC OF
  • [2] Baird L., 1995, Machine Learning. Proceedings of the Twelfth International Conference on Machine Learning, P30
  • [3] Bhatnagar Shalabh, ADV NEURAL INFORM PR, V22, P1204
  • [4] Espeholt L., 2018, ARXIV180201561
  • [5] Heess N., EMERGENCE LOCOMOTION
  • [6] Horgan Dan., 2018, Distributed Prioritized Experience Replay
  • [7] Jaderberg M., Reinforcement learning with unsupervised auxiliary tasks
  • [8] Juliani A., UNITY GEN PLATFORM I
  • [9] Konda Vijay R, ADV NEURAL INFORM PR, P1008
  • [10] LILLICRAP T P, Continuous control with deep reinforcement learning