Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

被引：0

作者：

Zhang, Zihan ^{[1
]}

Jiang, Yuhang ^{[1
]}

Zhou, Yuan ^{[2
,3
]}

Ji, Xiangyang ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China

[2] Tsinghua Univ, Yau Math Sci Ctr, Beijing, Peoples R China

[3] Tsinghua Univ, Dept Math Sci, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with S states, A actions and planning horizon H, we design a computational efficient algorithm to achieve near-optimal regret of (O) over tilde (root SAH(3)K ln(1/delta))(5) in K episodes using O (H + log(2) log(2)(K)) batches with confidence parameter delta. To our best of knowledge, it is the first (O) over tilde ( root SAH(3)K) regret bound with O(H + log(2) log(2)(K)) batch complexity. Meanwhile, we show that to achieve (O) over tilde (poly(S, A, H)root K) regret, the number of batches is at least Omega(H/log(A)(K) + log(2) log(2)(K)), which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.

引用

页数：11

共 50 条

[41] Optimal Regret Bounds for Collaborative Learning in Bandits
Shidani, Amitis
Vakili, Sattar
INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 237, 2024, 237
[42] A Biased Graph Neural Network Sampler with Near-Optimal Regret
Zhang, Qingru
Wipf, David
Gan, Quan
Song, Le
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[43] A Bayesian reinforcement learning approach in markov games for computing near-optimal policies
Julio B. Clempner
Annals of Mathematics and Artificial Intelligence, 2023, 91 : 675 - 690
[44] A Bayesian reinforcement learning approach in markov games for computing near-optimal policies
Clempner, Julio B.
ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2023, 91 (05) : 675 - 690
[45] Near-optimal reinforcement learning framework for energy-aware sensor communications
Pandana, C
Liu, KJR
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2005, 23 (04) : 788 - 797
[46] Near-Optimal Vehicular Crowdsensing Task Allocation Empowered by Deep Reinforcement Learning
Xiang C.-C.
Li Y.-Y.
Feng L.
Chen C.
Guo S.-T.
Yang P.-L.
Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (05): : 918 - 934
[47] Near-Optimal Provable Uniform Convergence in Offine Policy Evaluation for Reinforcement Learning
Yin, Ming
Bai, Yu
Wang, Yu-Xiang
24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
[48] Near-Optimal Sample Complexity Bounds for Constrained MDPs
Vaswani, Sharan
Yang, Lin F.
Szepesvari, Csaba
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[49] Near-Optimal Complexity Bounds for Fragments of the Skolem Problem
Akshay, S.
Balaji, Nikhil
Murhekar, Aniket
Varma, Rohith
Vyas, Nikhil
37TH INTERNATIONAL SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE (STACS 2020), 2020, 154
[50] Near-optimal lower bounds on the multi-party communication complexity of set disjointness
Chakrabarti, A
Khot, S
Sun, XD
18TH IEEE ANNUAL CONFERENCE ON COMPUTATIONAL COMPLEXITY, PROCEEDINGS, 2003, : 107 - 117

← 1 2 3 4 5 →