Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning

被引：0

作者：

Li, Gen ^{[1
]}

Cai, Changxiao ^{[2
]}

Chen, Yuxin ^{[2
]}

Gu, Yuantao ^{[1
]}

Wei, Yuting ^{[3
]}

Chi, Yuejie ^{[4
]}

机构：

[1] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China

[2] Princeton Univ, Dept Elect & Comp Engn, Princeton, NJ 08544 USA

[3] Carnegie Mellon Univ, Dept Stat & Data Sci, Pittsburgh, PA 15213 USA

[4] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | 2021年 / 139卷

关键词：

STOCHASTIC-APPROXIMATION; CONVERGENCE; BOUNDS; RATES;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. Focusing on the synchronous setting (such that independent samples for all state-action pairs are queried via a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. To yield an entrywise epsilon-accurate estimate of the optimal Q-function, state-of-the-art theory requires at least an order of vertical bar S vertical bar vertical bar A vertical bar/(1-gamma)(5)epsilon(2) samples in the infinite-horizon gamma-discounted setting. In this work, we sharpen the sample complexity of synchronous Q-learning to the order of vertical bar S vertical bar vertical bar A vertical bar/(1-gamma)(4)epsilon(2) (up to some logarithmic factor) for any 0 < epsilon < 1, leading to an order-wise improvement in 1/(1-gamma). Analogous results are derived for finite-horizon MDPs as well. Our sample complexity analysis unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage. Our result is obtained by identifying novel error decompositions and recursions, which might shed light on how to study other variants of Q-learning.

引用

页数：11

共 55 条

[1]

Agarwal Alekh, 2020, P MACHINE LEARNING R, V125

[2]

[Anonymous], 2011, Technical report

[3]

[Anonymous], 2018, ADV NEURAL INFORM PR

[4] Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model [J].

Azar, Mohammad Gheshlaghi ;

Munos, Remi ;

Kappen, Hilbert J. .

MACHINE LEARNING, 2013, 91 (03) :325-349

[5] Error bounds for constant step-size Q-learning [J].

Beck, C. L. ;

Srikant, R. .

SYSTEMS & CONTROL LETTERS, 2012, 61 (12) :1203-1208

[6] ON THE THEORY OF DYNAMIC PROGRAMMING [J].

BELLMAN, R .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1952, 38 (08) :716-719

[7]

Bertsekas D. P., 2017, DYNAMIC PROGRAMMING, V4th

[8]

Bhandari J., 2018, C LEARN THEOR COLT, P1691

[9] The ODE method for convergence of stochastic approximation and reinforcement learning [J].

Borkar, VS ;

Meyn, SP .

SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469

[10]

CAI Z., 2019, Advances in Neural Information Processing Systems, V32, P11312

← 1 2 3 4 5 6 →