Learning Infinite-Horizon Average-Reward Restless Multi-Action Bandits via Index Awareness

被引:0
|
作者
Xiong, Guojun [1 ]
Wang, Shufan [1 ]
Li, Jian [1 ]
机构
[1] SUNY Binghamton, Binghamton, NY 13902 USA
基金
美国国家科学基金会;
关键词
RESOURCE-ALLOCATION; MULTIARMED BANDIT; HEURISTICS; OPTIMALITY;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider the online restless bandits with average-reward and multiple actions, where the state of each arm evolves according to a Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. Since finding the optimal control is typically intractable for restless bandits, existing learning algorithms are often computationally expensive or with a regret bound that is exponential in the number of arms and states. In this paper, we advocate index-aware reinforcement learning (RL) solutions to design RL algorithms operating on a much smaller dimensional subspace by exploiting the inherent structure in restless bandits. Specifically, we first propose novel index policies to address dimensionality concerns, which are provably optimal. We then leverage the indices to develop two low-complexity index-aware RL algorithms, namely, (i) GM-R2MAB, which has access to a generative model; and (ii) UC-R2MAB, which learns the model using an upper confidence style online exploitation method. We prove that both algorithms achieve a sublinear regret that is only polynomial in the number of arms and states. A key differentiator between our algorithms and existing ones stems from the fact that our RL algorithms contain a novel exploitation that leverages our proposed provably optimal index policies for decision-makings.
引用
收藏
页数:15
相关论文
共 10 条
  • [1] Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
    Wei, Chen-Yu
    Jafarnia-Jahromi, Mehdi
    Luo, Haipeng
    Jain, Rahul
    24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
  • [2] Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints
    Chen, Liyu
    Jain, Rahul
    Luo, Haipeng
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [3] Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
    Wu, Yue
    Zhou, Dongruo
    Gu, Quanquan
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
  • [4] ON THE ASYMPTOTIC OPTIMALITY OF GREEDY INDEX HEURISTICS FOR MULTI-ACTION RESTLESS BANDITS
    Hodge, D. J.
    Glazebrook, K. D.
    ADVANCES IN APPLIED PROBABILITY, 2015, 47 (03) : 652 - 667
  • [5] Q-Learning Lagrange Policies for Multi-Action Restless Bandits
    Killian, Jackson A.
    Biswas, Arpita
    Shah, Sanket
    Tambe, Milind
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 871 - 881
  • [6] Whittle index based Q-learning for restless bandits with average reward
    Avrachenkov, Konstantin E.
    Borkar, Vivek S.
    AUTOMATICA, 2022, 139
  • [7] A Provably-Efficient Model-Free Algorithm for Infinite-Horizon Average-Reward Constrained Markov Decision Processes
    Wei, Honghao
    Liu, Xin
    Ying, Lei
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3868 - 3876
  • [8] AN ASYMPTOTICALLY OPTIMAL HEURISTIC FOR GENERAL NONSTATIONARY FINITE-HORIZON RESTLESS MULTI-ARMED, MULTI-ACTION BANDITS
    Zayas-Caban, Gabriel
    Jasin, Stefanus
    Wang, Guihua
    ADVANCES IN APPLIED PROBABILITY, 2019, 51 (03) : 745 - 772
  • [9] Convergence Rates of Average-Reward Multi-agent Reinforcement Learning via Randomized Linear Programming
    Koppel, Alec
    Bedi, Amrit Singh
    Ganguly, Bhargav
    Aggarwal, Vaneet
    2022 IEEE 61ST CONFERENCE ON DECISION AND CONTROL (CDC), 2022, : 4545 - 4552
  • [10] An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits (vol 51, pg 745, 2019)
    Zayas-Caban, Gabriel
    Liang, Jiaxin
    Jasin, Stefanus
    Wang, Guihua
    ADVANCES IN APPLIED PROBABILITY, 2023, 55 (03) : 1075 - 1083