Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning

被引:0
作者
Li, Gen [1 ]
Cai, Changxiao [2 ]
Chen, Yuxin [2 ]
Gu, Yuantao [1 ]
Wei, Yuting [3 ]
Chi, Yuejie [4 ]
机构
[1] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China
[2] Princeton Univ, Dept Elect & Comp Engn, Princeton, NJ 08544 USA
[3] Carnegie Mellon Univ, Dept Stat & Data Sci, Pittsburgh, PA 15213 USA
[4] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA
来源
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | 2021年 / 139卷
关键词
STOCHASTIC-APPROXIMATION; CONVERGENCE; BOUNDS; RATES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. Focusing on the synchronous setting (such that independent samples for all state-action pairs are queried via a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. To yield an entrywise epsilon-accurate estimate of the optimal Q-function, state-of-the-art theory requires at least an order of vertical bar S vertical bar vertical bar A vertical bar/(1-gamma)(5)epsilon(2) samples in the infinite-horizon gamma-discounted setting. In this work, we sharpen the sample complexity of synchronous Q-learning to the order of vertical bar S vertical bar vertical bar A vertical bar/(1-gamma)(4)epsilon(2) (up to some logarithmic factor) for any 0 < epsilon < 1, leading to an order-wise improvement in 1/(1-gamma). Analogous results are derived for finite-horizon MDPs as well. Our sample complexity analysis unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage. Our result is obtained by identifying novel error decompositions and recursions, which might shed light on how to study other variants of Q-learning.
引用
收藏
页数:11
相关论文
共 55 条
[1]  
Agarwal Alekh, 2020, P MACHINE LEARNING R, V125
[2]  
[Anonymous], 2011, Technical report
[3]  
[Anonymous], 2018, ADV NEURAL INFORM PR
[4]   Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model [J].
Azar, Mohammad Gheshlaghi ;
Munos, Remi ;
Kappen, Hilbert J. .
MACHINE LEARNING, 2013, 91 (03) :325-349
[5]   Error bounds for constant step-size Q-learning [J].
Beck, C. L. ;
Srikant, R. .
SYSTEMS & CONTROL LETTERS, 2012, 61 (12) :1203-1208
[7]  
Bertsekas D. P., 2017, DYNAMIC PROGRAMMING, V4th
[8]  
Bhandari J., 2018, C LEARN THEOR COLT, P1691
[9]   The ODE method for convergence of stochastic approximation and reinforcement learning [J].
Borkar, VS ;
Meyn, SP .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469
[10]  
CAI Z., 2019, Advances in Neural Information Processing Systems, V32, P11312