Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

被引：0

作者：

Wang, Bingyan ^{[1
]}

Yan, Yuling ^{[1
]}

Fan, Jianqing ^{[1
]}

机构：

[1] Princeton Univ, Princeton, NJ 08544 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

关键词：

BOUNDS; RATES; GAME; GO;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space S and the action space A are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax-optimal sample complexity scales linearly with vertical bar S vertical bar x vertical bar A vertical bar, which can be prohibitively large when S or A is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an epsilon-optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of K/(1-gamma)(4) epsilon(2) (resp. K/(1-gamma)(4) epsilon(2)), up to some logarithmic factor. Here K is the feature dimension and gamma is an element of (0, 1) is the discount factor of the MDP. The results is applicable to the tabular MDPs by taking the coordinate basis with K = vertical bar S vertical bar x vertical bar A vertical bar. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when K is relatively small, and hence the title of this paper.

引用

页数：14

共 68 条

[1] Agarwal A., 2020, ARXIV200708459
[2] Agarwal A, 2020, C LEARNING THEORY, P67
[3] Learning Topic Models - Going beyond SVD
Arora, Sanjeev
Ge, Rong
Moitra, Ankur
[J]. 2012 IEEE 53RD ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2012, : 1 - 10
[4] Azar M. G., 2011, Speedy q-learning
[5] Azar MG, 2017, PR MACH LEARN RES, V70
[6] Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
Azar, Mohammad Gheshlaghi
Munos, Remi
Kappen, Hilbert J.
[J]. MACHINE LEARNING, 2013, 91 (03) : 325 - 349
[7] Azizzadenesheli K, 2018, 2018 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA)
[8] ON THE THEORY OF DYNAMIC PROGRAMMING
BELLMAN, R
[J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1952, 38 (08) : 716 - 719
[9] Bellman R., 1959, Mathematical Tables and Other Aids to Computation, P247
[10] Bertsekas D. P., 1996, Neuro dynamic programming

← 1 2 3 4 5 6 7 →