Q-learning with Nearest Neighbors

被引:0
作者
Shah, Devavrat [1 ,2 ]
Xie, Qiaomin [1 ]
机构
[1] MIT, LIDS, Cambridge, MA 02139 USA
[2] MIT, Stat & Data Sci Ctr, Dept EECS, Cambridge, MA 02139 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018) | 2018年 / 31卷
关键词
MARKOV DECISION-PROCESSES; CONVERGENCE; APPROXIMATION; TREES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a d-dimensional state space and the discounted factor gamma is an element of (0, 1), given an arbitrary sample path with "covering time" L, we establish that the algorithm is guaranteed to output an E-accurate estimate of the optimal Q-function using (O) over tilde (L/(epsilon(3)(1 - gamma)(7))) samples. For instance, for a wellbehaved MDP, the covering time of the sample path under the purely random policy scales as (O) over tilde (1/epsilon(d)), so the sample complexity scales as (O) over tilde (1/epsilon(d+3)). Indeed, we establish a lower bound that argues that the dependence of (Omega) over tilde (1/epsilon(d+2)) is necessary.
引用
收藏
页数:11
相关论文
共 50 条
[31]   A Q-Learning Approach to Derive Optimal Consumption and Investment Strategies [J].
Weissensteiner, Alex .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2009, 20 (08) :1234-1243
[32]   Entropy-Based Prioritized Sampling in Deep Q-Learning [J].
Ramicic, Mirza ;
Bonarini, Andrea .
2017 2ND INTERNATIONAL CONFERENCE ON IMAGE, VISION AND COMPUTING (ICIVC 2017), 2017, :1068-1072
[33]   The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond [J].
Woo, Jiin ;
Joshi, Gauri ;
Chi, Yuejie .
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
[34]   A DISCRETE-TIME SWITCHING SYSTEM ANALYSIS OF Q-LEARNING [J].
Lee, Donghwan ;
Hu, Jianghai ;
He, Niao .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2023, 61 (03) :1861-1880
[35]   UNSYNCHRONIZED DECENTRALIZED Q-LEARNING: TWO TIMESCALE ANALYSIS BY PERSISTENCE [J].
Yongacoglu, Bora ;
Arslan, Gurdal ;
Yuksel, Serdar .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2025, 63 (04) :2224-2250
[36]   Risk Aversion Operator for Addressing Maximization Bias in Q-Learning [J].
Wang, Bi ;
Li, Xuelian ;
Gao, Zhiqiang ;
Zhong, Yangjun .
IEEE ACCESS, 2020, 8 :43098-43110
[37]   Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis [J].
Li, Gen ;
Cai, Changxiao ;
Chen, Yuxin ;
Wei, Yuting ;
Chi, Yuejie .
OPERATIONS RESEARCH, 2024, 72 (01) :222-236
[38]   Error bounds for constant step-size Q-learning [J].
Beck, C. L. ;
Srikant, R. .
SYSTEMS & CONTROL LETTERS, 2012, 61 (12) :1203-1208
[39]   Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming [J].
Bertsekas, Dimitri P. ;
Yu, Huizhen .
MATHEMATICS OF OPERATIONS RESEARCH, 2012, 37 (01) :66-94
[40]   Inverse Value Iteration and Q-Learning: Algorithms, Stability, and Robustness [J].
Lian, Bosen ;
Xue, Wenqian ;
Lewis, Frank L. ;
Davoudi, Ali .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (04) :6970-6980