Towards Understanding Asynchronous Advantage Actor-Critic: Convergence and Linear Speedup

被引:18
作者
Shen, Han [1 ]
Zhang, Kaiqing [2 ,3 ]
Hong, Mingyi [4 ]
Chen, Tianyi [1 ]
机构
[1] Rensselaer Polytech Inst, Dept Elect Comp & Syst Engn, Troy, NY 12180 USA
[2] MIT, Lab Informat & Decis Syst & Comp Sci, Cambridge, MA 02139 USA
[3] MIT, Artificial Intelligence Lab, Cambridge, MA 02139 USA
[4] Univ Minnesota, Dept Elect & Comp Engn, Minneapolis, MN 55455 USA
关键词
Reinforcement learning; policy gradient; actor critic; asynchronous parallel method; REINFORCEMENT; TIME; OPTIMIZATION; ALGORITHMS;
D O I
10.1109/TSP.2023.3268475
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Asynchronous and parallel implementation of standard reinforcement learning (RL) algorithms is a key enabler of the tremendous success of modern RL. Among many asynchronous RL algorithms, arguably the most popular and effective one is the asynchronous advantage actor-critic (A3C) algorithm. Although A3C is becoming the workhorse of RL, its theoretical properties are still not well-understood, including its non-asymptotic analysis and the performance gain of parallelism (a.k.a. linear speedup). This paper revisits the A3C algorithm and establishes its non-asymptotic convergence guarantees. Under both i.i.d. and Markovian sampling, we establish the local convergence guarantee for A3C in the general policy approximation case and the global convergence guarantee in softmax policy parameterization. Under i.i.d. sampling, A3C obtains sample complexity of O(e(-2.5)/N) per worker to achieve e accuracy, where N is the number of workers. Compared to the best-known sample complexity of O (e(-2.5)) for two-timescale AC, A3C achieves linear speedup, which justifies the advantage of parallelism and asynchrony in AC algorithms theoretically for the first time. Numerical tests on synthetic environment, OpenAI Gym environments and Atari games have been provided to verify our theoretical analysis.
引用
收藏
页码:2579 / 2594
页数:16
相关论文
共 57 条
[1]  
Agarwal Alekh., 2011, Advances in Neural Information Processing Systems, P873
[2]  
Agarwal Alekh, 2020, P MACHINE LEARNING R, V125
[3]  
Assran M., 2019, Advances in Neural Information Processing Systems, V32, P13320
[4]  
Babaeizadeh M., 2017, PhD
[5]   Infinite-horizon policy-gradient estimation [J].
Baxter, J ;
Bartlett, PL .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 :319-350
[6]  
Bertsekas D., 1997, PARALLEL DISTRIBUTED
[7]  
Bhandari J, 2022, Arxiv, DOI arXiv:1906.01786
[8]  
Bhandari J, 2021, PR MACH LEARN RES, V130
[9]  
Bhandari Jalaj., 2018, COLT, P1691
[10]   Natural actor-critic algorithms [J].
Bhatnagar, Shalabh ;
Sutton, Richard S. ;
Ghavamzadeh, Mohammad ;
Lee, Mark .
AUTOMATICA, 2009, 45 (11) :2471-2482