Faster Non-asymptotic Convergence for Double Q-learning

被引:0
作者
Zhao, Lin [1 ]
Xiong, Huaqing [2 ]
Liang, Yingbin [2 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Ohio State Univ, Columbus, OH 43210 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷
基金
美国国家科学基金会;
关键词
STOCHASTIC-APPROXIMATION; BOUNDS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Double Q-learning (Hasselt, 2010) has gained significant success in practice due to its effectiveness in overcoming the overestimation issue of Q-learning. However, the theoretical understanding of double Q-learning is rather limited. The only existing finite-time analysis was recently established in (Xiong et al., 2020), where the polynomial learning rate adopted in the analysis typically yields a slower convergence rate. This paper tackles the more challenging case of a constant learning rate, and develops new analytical tools that improve the existing convergence rate by orders of magnitude. Specifically, we show that synchronous double Q-learning attains an epsilon-accurate global optimum with a time complexity of (Omega) over tilde (ln D/(1-gamma)(7)epsilon(2)), and the asynchronous algorithm achieves a time complexity of (Omega) over tilde (L/(1-gamma)(7)epsilon(2)), where D is the cardinality of the state-action space, gamma is the discount factor, and L is a parameter related to the sampling strategy for asynchronous double Q-learning. These results improve the existing convergence rate by the order of magnitude in terms of its dependence on all major parameters (epsilon, 1 - gamma, D, L). This paper presents a substantial step toward the full understanding of the fast convergence of double-Q learning.
引用
收藏
页数:12
相关论文
共 42 条
[1]  
Abed-alguni B. H., 2018, Int. J. Artif. Intell, V16, P41
[2]  
[Anonymous], 1996, NEURODYNAMIC PROGRAM
[3]  
[Anonymous], 1995, Machine Learning Proceedings 1995
[4]   Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model [J].
Azar, Mohammad Gheshlaghi ;
Munos, Remi ;
Kappen, Hilbert J. .
MACHINE LEARNING, 2013, 91 (03) :325-349
[5]   Error bounds for constant step-size Q-learning [J].
Beck, C. L. ;
Srikant, R. .
SYSTEMS & CONTROL LETTERS, 2012, 61 (12) :1203-1208
[6]  
Bhandari J., 2018, C LEARN THEOR COLT, P1691
[7]   The ODE method for convergence of stochastic approximation and reinforcement learning [J].
Borkar, VS ;
Meyn, SP .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469
[8]  
Cai Q, 2019, 33 C NEURAL INFORM P, V32
[9]  
Carvalho D.., 2020, Advances in Neural Information Processing Systems
[10]  
Chen Z., 2021, LYAPUNOV THEORY FINI