A DISCRETE-TIME SWITCHING SYSTEM ANALYSIS OF Q-LEARNING

被引：4

作者：

Lee, Donghwan ^{[1
]}

Hu, Jianghai ^{[2
]}

He, Niao ^{[3
]}

机构：

[1] Korea Adv Inst Sci & Technol KAIST, Dept Elect & Engn, Daejeon 34141, South Korea

[2] Purdue Univ, Dept Elect & Comp Engn, W Lafayette, IN 47906 USA

[3] Swiss Fed Inst Technol, Dept Comp Sci, CH-8092 Zurich, Switzerland

来源：

SIAM JOURNAL ON CONTROL AND OPTIMIZATION | 2023年 / 61卷 / 03期

基金：

瑞士国家科学基金会; 美国国家科学基金会; 新加坡国家研究基金会;

关键词：

Q-learning; switched linear system; stochastic approximation; STOCHASTIC-APPROXIMATION; CONVERGENCE; RATES;

D O I：

10.1137/22M1489976

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper develops a novel control-theoretic framework to analyze the nonasymptotic convergence of Q-learning. We show that the dynamics of asynchronous Q-learning with a constant step size can be naturally formulated as a discrete-time stochastic affine switching system. In particular, for a given Q-function parameter, Q, the greedy policy, \pi Q(s) := arg maxaQ(s, a), in the Q-learning update plays the role of the switching policy, and is the key connection between the switching system and Q-learning. Then, the evolution of the Q-learning estimation error is over- and under-estimated by trajectories of two simpler dynamical systems. Based on these two systems, we derive a new finite-time error bound of asynchronous Q-learning when a constant step size is used. In addition, the new analysis sheds light on the overestimation phenomenon of Q-learning.

引用

页码：1861 / 1880

页数：20

共 20 条

[1] Error bounds for constant step-size Q-learning [J].

Beck, C. L. ;

Srikant, R. .

SYSTEMS & CONTROL LETTERS, 2012, 61 (12) :1203-1208

[2] The ODE method for convergence of stochastic approximation and reinforcement learning [J].

Borkar, VS ;

Meyn, SP .

SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469

[3]

Chen ZW, 2021, Arxiv, DOI arXiv:2102.01567

[4]

Even-Dar E, 2003, J MACH LEARN RES, V5, P1

[5]

Ghavamzadeh Mohammad, 2011, Advances in Neural Information Processing Systems, V24, P2411

[6] Boundedness of iterates in Q-Learning [J].

Gosavi, A .

SYSTEMS & CONTROL LETTERS, 2006, 55 (04) :347-349

[7]

Hasselt H., 2010, Advances in neural information processing systems, V23

[8]

Wainwright MJ, 2019, Arxiv, DOI arXiv:1905.06265

[9] ON THE CONVERGENCE OF STOCHASTIC ITERATIVE DYNAMIC-PROGRAMMING ALGORITHMS [J].

JAAKKOLA, T ;

JORDAN, MI ;

SINGH, SP .

NEURAL COMPUTATION, 1994, 6 (06) :1185-1201

[10] QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus plus Innovations [J].

Kar, Soummya ;

Moura, Jose M. F. ;

Poor, H. Vincent .

IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2013, 61 (07) :1848-1862

← 1 2 →