Target Network and Truncation Overcome the Deadly Triad in Q-Learning

被引:2
作者
Chend, Zaiwei [1 ]
Clarke, John-Paul [2 ]
Maguluri, Siva Theja [3 ]
机构
[1] CALTECH, Comp Math Sci, Pasadena, CA 91106 USA
[2] Univ Texas Austin, Aerosp Engn & Engn Mech, Austin, TX 78712 USA
[3] Georgia Inst Technol, Ind Syst Engn, Atlanta, GA 30332 USA
来源
SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE | 2023年 / 5卷 / 04期
关键词
reinforcement learning; Q-learning; linear function approximation; finite-sample analysis; STOCHASTIC-APPROXIMATION;
D O I
10.1137/22M1499261
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Q-learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms and was identified in [R. S. Sutton, in European Conference on Computational Learning Theory, Springer, New York, 1999, pp. 11-17] as one of the most important theoretical open problems in the RL community. Even in the basic setting where linear function approximation is used, there are well-known divergent examples. In this work, we propose a stable online variant of Q-learning with linear function approximation that uses target network and truncation and is driven by a single trajectory of Markovian samples. We present the finite-sample guarantees of the algorithm, which imply a sample complexity of (O) over tilde (epsilon(-2)) up to a function approximation error. Importantly, we establish the results under minimal assumptions and do not modify the problem parameters to achieve stability.
引用
收藏
页码:1078 / 1101
页数:24
相关论文
共 55 条
[1]  
Agarwal N., 2021, P INT C LEARN REPR
[2]  
Baird L., 1995, MACHINE LEARNING P 1, P30
[3]  
Bertsekas D. P., 1996, Neuro-dynamic Programming
[4]  
Bhandari J., 2018, Conference On Learning Theory, P1691
[5]   A concentration bound for contractive stochastic approximation [J].
Borkar, Vivek S. .
SYSTEMS & CONTROL LETTERS, 2021, 153
[6]   The ODE method for convergence of stochastic approximation and reinforcement learning [J].
Borkar, VS ;
Meyn, SP .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469
[7]   Neural Temporal Difference and Q Learning Provably Converge to Global Optima [J].
Cai, Qi ;
Yang, Zhuoran ;
Lee, Jason D. ;
Wang, Zhaoran .
MATHEMATICS OF OPERATIONS RESEARCH, 2024, 49 (01) :619-651
[8]  
Carvalho D., 2020, ADV NEUR IN, V33
[9]  
Chen Z., 2022, arXiv
[10]  
Chen Z., 2021, Advances in Neural Information Processing Systems, V34