A DISCRETE-TIME SWITCHING SYSTEM ANALYSIS OF Q-LEARNING

被引:4
作者
Lee, Donghwan [1 ]
Hu, Jianghai [2 ]
He, Niao [3 ]
机构
[1] Korea Adv Inst Sci & Technol KAIST, Dept Elect & Engn, Daejeon 34141, South Korea
[2] Purdue Univ, Dept Elect & Comp Engn, W Lafayette, IN 47906 USA
[3] Swiss Fed Inst Technol, Dept Comp Sci, CH-8092 Zurich, Switzerland
基金
瑞士国家科学基金会; 美国国家科学基金会; 新加坡国家研究基金会;
关键词
Q-learning; switched linear system; stochastic approximation; STOCHASTIC-APPROXIMATION; CONVERGENCE; RATES;
D O I
10.1137/22M1489976
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper develops a novel control-theoretic framework to analyze the nonasymptotic convergence of Q-learning. We show that the dynamics of asynchronous Q-learning with a constant step size can be naturally formulated as a discrete-time stochastic affine switching system. In particular, for a given Q-function parameter, Q, the greedy policy, \pi Q(s) := arg maxaQ(s, a), in the Q-learning update plays the role of the switching policy, and is the key connection between the switching system and Q-learning. Then, the evolution of the Q-learning estimation error is over- and under-estimated by trajectories of two simpler dynamical systems. Based on these two systems, we derive a new finite-time error bound of asynchronous Q-learning when a constant step size is used. In addition, the new analysis sheds light on the overestimation phenomenon of Q-learning.
引用
收藏
页码:1861 / 1880
页数:20
相关论文
共 20 条
[1]   Error bounds for constant step-size Q-learning [J].
Beck, C. L. ;
Srikant, R. .
SYSTEMS & CONTROL LETTERS, 2012, 61 (12) :1203-1208
[2]   The ODE method for convergence of stochastic approximation and reinforcement learning [J].
Borkar, VS ;
Meyn, SP .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469
[3]  
Chen ZW, 2021, Arxiv, DOI arXiv:2102.01567
[4]  
Even-Dar E, 2003, J MACH LEARN RES, V5, P1
[5]  
Ghavamzadeh Mohammad, 2011, Advances in Neural Information Processing Systems, V24, P2411
[6]   Boundedness of iterates in Q-Learning [J].
Gosavi, A .
SYSTEMS & CONTROL LETTERS, 2006, 55 (04) :347-349
[7]  
Hasselt H., 2010, Advances in neural information processing systems, V23
[8]  
Wainwright MJ, 2019, Arxiv, DOI arXiv:1905.06265
[9]   ON THE CONVERGENCE OF STOCHASTIC ITERATIVE DYNAMIC-PROGRAMMING ALGORITHMS [J].
JAAKKOLA, T ;
JORDAN, MI ;
SINGH, SP .
NEURAL COMPUTATION, 1994, 6 (06) :1185-1201
[10]   QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus plus Innovations [J].
Kar, Soummya ;
Moura, Jose M. F. ;
Poor, H. Vincent .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2013, 61 (07) :1848-1862