Learning Infinite-Horizon Average-Reward Restless Multi-Action Bandits via Index Awareness

被引：0

作者：

Xiong, Guojun ^{[1
]}

Wang, Shufan ^{[1
]}

Li, Jian ^{[1
]}

机构：

[1] SUNY Binghamton, Binghamton, NY 13902 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

基金：

美国国家科学基金会;

关键词：

RESOURCE-ALLOCATION; MULTIARMED BANDIT; HEURISTICS; OPTIMALITY;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We consider the online restless bandits with average-reward and multiple actions, where the state of each arm evolves according to a Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. Since finding the optimal control is typically intractable for restless bandits, existing learning algorithms are often computationally expensive or with a regret bound that is exponential in the number of arms and states. In this paper, we advocate index-aware reinforcement learning (RL) solutions to design RL algorithms operating on a much smaller dimensional subspace by exploiting the inherent structure in restless bandits. Specifically, we first propose novel index policies to address dimensionality concerns, which are provably optimal. We then leverage the indices to develop two low-complexity index-aware RL algorithms, namely, (i) GM-R2MAB, which has access to a generative model; and (ii) UC-R2MAB, which learns the model using an upper confidence style online exploitation method. We prove that both algorithms achieve a sublinear regret that is only polynomial in the number of arms and states. A key differentiator between our algorithms and existing ones stems from the fact that our RL algorithms contain a novel exploitation that leverages our proposed provably optimal index policies for decision-makings.

引用

页数：15

共 10 条

[1] Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
Wei, Chen-Yu
Jafarnia-Jahromi, Mehdi
Luo, Haipeng
Jain, Rahul
24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
[2] Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints
Chen, Liyu
Jain, Rahul
Luo, Haipeng
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[3] Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
Wu, Yue
Zhou, Dongruo
Gu, Quanquan
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
[4] ON THE ASYMPTOTIC OPTIMALITY OF GREEDY INDEX HEURISTICS FOR MULTI-ACTION RESTLESS BANDITS
Hodge, D. J.
Glazebrook, K. D.
ADVANCES IN APPLIED PROBABILITY, 2015, 47 (03) : 652 - 667
[5] Q-Learning Lagrange Policies for Multi-Action Restless Bandits
Killian, Jackson A.
Biswas, Arpita
Shah, Sanket
Tambe, Milind
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 871 - 881
[6] Whittle index based Q-learning for restless bandits with average reward
Avrachenkov, Konstantin E.
Borkar, Vivek S.
AUTOMATICA, 2022, 139
[7] A Provably-Efficient Model-Free Algorithm for Infinite-Horizon Average-Reward Constrained Markov Decision Processes
Wei, Honghao
Liu, Xin
Ying, Lei
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3868 - 3876
[8] AN ASYMPTOTICALLY OPTIMAL HEURISTIC FOR GENERAL NONSTATIONARY FINITE-HORIZON RESTLESS MULTI-ARMED, MULTI-ACTION BANDITS
Zayas-Caban, Gabriel
Jasin, Stefanus
Wang, Guihua
ADVANCES IN APPLIED PROBABILITY, 2019, 51 (03) : 745 - 772
[9] Convergence Rates of Average-Reward Multi-agent Reinforcement Learning via Randomized Linear Programming
Koppel, Alec
Bedi, Amrit Singh
Ganguly, Bhargav
Aggarwal, Vaneet
2022 IEEE 61ST CONFERENCE ON DECISION AND CONTROL (CDC), 2022, : 4545 - 4552
[10] An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits (vol 51, pg 745, 2019)
Zayas-Caban, Gabriel
Liang, Jiaxin
Jasin, Stefanus
Wang, Guihua
ADVANCES IN APPLIED PROBABILITY, 2023, 55 (03) : 1075 - 1083

← 1 →