Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

被引:0
作者
Wang, Bingyan [1 ]
Yan, Yuling [1 ]
Fan, Jianqing [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷
关键词
BOUNDS; RATES; GAME; GO;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space S and the action space A are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax-optimal sample complexity scales linearly with vertical bar S vertical bar x vertical bar A vertical bar, which can be prohibitively large when S or A is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an epsilon-optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of K/(1-gamma)(4) epsilon(2) (resp. K/(1-gamma)(4) epsilon(2)), up to some logarithmic factor. Here K is the feature dimension and gamma is an element of (0, 1) is the discount factor of the MDP. The results is applicable to the tabular MDPs by taking the coordinate basis with K = vertical bar S vertical bar x vertical bar A vertical bar. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when K is relatively small, and hence the title of this paper.
引用
收藏
页数:14
相关论文
共 68 条
  • [1] Agarwal A., 2020, ARXIV200708459
  • [2] Agarwal A, 2020, C LEARNING THEORY, P67
  • [3] Learning Topic Models - Going beyond SVD
    Arora, Sanjeev
    Ge, Rong
    Moitra, Ankur
    [J]. 2012 IEEE 53RD ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2012, : 1 - 10
  • [4] Azar M. G., 2011, Speedy q-learning
  • [5] Azar MG, 2017, PR MACH LEARN RES, V70
  • [6] Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
    Azar, Mohammad Gheshlaghi
    Munos, Remi
    Kappen, Hilbert J.
    [J]. MACHINE LEARNING, 2013, 91 (03) : 325 - 349
  • [7] Azizzadenesheli K, 2018, 2018 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA)
  • [9] Bellman R., 1959, Mathematical Tables and Other Aids to Computation, P247
  • [10] Bertsekas D. P., 1996, Neuro dynamic programming