A new noise network and gradient parallelisation-based asynchronous advantage actor-critic algorithm

被引：1

作者：

Fei, Zhengshun ^{[1
]}

Wang, Yanping ^{[1
]}

Wang, Jinglong ^{[1
]}

Liu, Kangling ^{[2
]}

Huang, Bingqiang ^{[1
]}

Tan, Ping ^{[1
]}

机构：

[1] Zhejiang Univ Sci & Technol, Prov Key Inst Robot, Sch Automat & Elect Engn, Hangzhou, Peoples R China

[2] Zhejiang Univ, Coll Control Sci & Engn, State Key Lab Ind Control Technol, Hangzhou, Peoples R China

来源：

IET CYBER-SYSTEMS AND ROBOTICS | 2022年 / 4卷 / 03期

关键词：

asynchronous advantage actor-critic (A3C); generalised advantage estimation (GAE); parallelisation; reinforcement learning;

D O I：

10.1049/csy2.12059

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Asynchronous advantage actor-critic (A3C) algorithm is a commonly used policy optimization algorithm in reinforcement learning, in which asynchronous is parallel interactive sampling and training, and advantage is a sampling multi-step reward estimation method for computing weights. In order to address the problem of low efficiency and insufficient convergence caused by the traditional heuristic exploration of A3C algorithm in reinforcement learning, an improved A3C algorithm is proposed in this paper. In this algorithm, a noise network function, which updates the noise tensor in an explicit way is constructed to train the agent. Generalised advantage estimation (GAE) is also adopted to describe the dominance function. Finally, a new mean gradient parallelisation method is designed to update the parameters in both the primary and secondary networks by summing and averaging the gradients passed from all the sub-processes to the main process. Simulation experiments were conducted in a gym environment using the PyTorch Agent Net (PTAN) advanced reinforcement learning library, and the results show that the method enables the agent to complete the learning training faster and its convergence during the training process is better. The improved A3C algorithm has a better performance than the original algorithm, which can provide new ideas for subsequent research on reinforcement learning algorithms.

引用

页码：175 / 188

页数：14

共 27 条

[1] Multi-Agent Deep Reinforcement Learning to Manage Connected Autonomous Vehicles at Tomorrow's Intersections
Antonio, Guillen-Perez
Maria-Dolores, Cano
[J]. IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2022, 71 (07) : 7033 - 7043
[2] DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS
BELLMAN, R
[J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1956, 42 (10) : 767 - 769
[3] Bertsekas DP, 1995, PROCEEDINGS OF THE 34TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-4, P560, DOI 10.1109/CDC.1995.478953
[4] Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning
Brunke, Lukas
Greeff, Melissa
Hall, Adam W.
Yuan, Zhaocong
Zhou, Siqi
Panerati, Jacopo
Schoellig, Angela P.
[J]. ANNUAL REVIEW OF CONTROL ROBOTICS AND AUTONOMOUS SYSTEMS, 2022, 5 : 411 - 444
[5] Dunn W L., 2022, Exploring Monte Carlo Methods
[6] Fortunato M, 2019, Arxiv, DOI arXiv:1706.10295
[7] Multi-agent deep reinforcement learning: a survey
Gronauer, Sven
Diepold, Klaus
[J]. ARTIFICIAL INTELLIGENCE REVIEW, 2022, 55 (02) : 895 - 943
[8] Howard R. A., 1960, Dynamic programming and markov processes
[9] Reinforcement Learning Based on Contextual Bandits for Personalized Online Learning Recommendation Systems
Intayoad, Wacharawan
Kamyod, Chayapol
Temdee, Punnarumol
[J]. WIRELESS PERSONAL COMMUNICATIONS, 2020, 115 (04) : 2917 - 2932
[10] Bandit based Monte-Carlo planning
Kocsis, Levente
Szepesvari, Csaba
[J]. MACHINE LEARNING: ECML 2006, PROCEEDINGS, 2006, 4212 : 282 - 293

← 1 2 3 →