Faster Non-asymptotic Convergence for Double Q-learning

被引：0

作者：

Zhao, Lin ^{[1
]}

Xiong, Huaqing ^{[2
]}

Liang, Yingbin ^{[2
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Ohio State Univ, Columbus, OH 43210 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

基金：

美国国家科学基金会;

关键词：

STOCHASTIC-APPROXIMATION; BOUNDS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Double Q-learning (Hasselt, 2010) has gained significant success in practice due to its effectiveness in overcoming the overestimation issue of Q-learning. However, the theoretical understanding of double Q-learning is rather limited. The only existing finite-time analysis was recently established in (Xiong et al., 2020), where the polynomial learning rate adopted in the analysis typically yields a slower convergence rate. This paper tackles the more challenging case of a constant learning rate, and develops new analytical tools that improve the existing convergence rate by orders of magnitude. Specifically, we show that synchronous double Q-learning attains an epsilon-accurate global optimum with a time complexity of (Omega) over tilde (ln D/(1-gamma)(7)epsilon(2)), and the asynchronous algorithm achieves a time complexity of (Omega) over tilde (L/(1-gamma)(7)epsilon(2)), where D is the cardinality of the state-action space, gamma is the discount factor, and L is a parameter related to the sampling strategy for asynchronous double Q-learning. These results improve the existing convergence rate by the order of magnitude in terms of its dependence on all major parameters (epsilon, 1 - gamma, D, L). This paper presents a substantial step toward the full understanding of the fast convergence of double-Q learning.

引用

页数：12

共 42 条

[1]

Abed-alguni B. H., 2018, Int. J. Artif. Intell, V16, P41

[2]

[Anonymous], 1996, NEURODYNAMIC PROGRAM

[3]

[Anonymous], 1995, Machine Learning Proceedings 1995

[4] Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model [J].