Target Network and Truncation Overcome the Deadly Triad in Q-Learning

被引：2

作者：

Chend, Zaiwei ^{[1
]}

Clarke, John-Paul ^{[2
]}

Maguluri, Siva Theja ^{[3
]}

机构：

[1] CALTECH, Comp Math Sci, Pasadena, CA 91106 USA

[2] Univ Texas Austin, Aerosp Engn & Engn Mech, Austin, TX 78712 USA

[3] Georgia Inst Technol, Ind Syst Engn, Atlanta, GA 30332 USA

来源：

SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE | 2023年 / 5卷 / 04期

关键词：

reinforcement learning; Q-learning; linear function approximation; finite-sample analysis; STOCHASTIC-APPROXIMATION;

D O I：

10.1137/22M1499261

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Q-learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms and was identified in [R. S. Sutton, in European Conference on Computational Learning Theory, Springer, New York, 1999, pp. 11-17] as one of the most important theoretical open problems in the RL community. Even in the basic setting where linear function approximation is used, there are well-known divergent examples. In this work, we propose a stable online variant of Q-learning with linear function approximation that uses target network and truncation and is driven by a single trajectory of Markovian samples. We present the finite-sample guarantees of the algorithm, which imply a sample complexity of (O) over tilde (epsilon(-2)) up to a function approximation error. Importantly, we establish the results under minimal assumptions and do not modify the problem parameters to achieve stability.

引用

页码：1078 / 1101

页数：24

共 55 条

[1]

Agarwal N., 2021, P INT C LEARN REPR

[2]

Baird L., 1995, MACHINE LEARNING P 1, P30

[3]

Bertsekas D. P., 1996, Neuro-dynamic Programming

[4]

Bhandari J., 2018, Conference On Learning Theory, P1691