Combinatorial Sleeping Bandits With Fairness Constraints

被引:54
作者
Li, Fengjiao [1 ]
Liu, Jia [2 ]
Ji, Bo [1 ]
机构
[1] Temple Univ, Dept Comp & Informat Sci, Philadelphia, PA 19122 USA
[2] Iowa State Univ, Dept Comp Sci, Ames, IA 50011 USA
来源
IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING | 2020年 / 7卷 / 03期
基金
美国国家科学基金会;
关键词
Multi-armed bandit (MAB); combinatorial sleeping bandits; fairness; Upper Confidence Bound (UCB); virtual queue; Lyapunov drift; regret analysis; MULTIARMED BANDIT;
D O I
10.1109/TNSE.2019.2954310
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The multi-armed bandit (MAB) model has been widely adopted for studying many practical optimization problems (network resource allocation, ad placement, crowdsourcing, etc.) with unknown parameters. The goal of the player (i.e., the decision maker) here is to maximize the cumulative reward in the face of uncertainty. However, the basic MAB model neglects several important factors of the system in many real-world applications, where multiple arms (i.e., actions) can be simultaneously played and an arm could sometimes be "sleeping" (i.e., unavailable). Besides reward maximization, ensuring fairness is also a key design concern in practice. To that end, we propose a new Combinatorial Sleeping MAB model with Fairness constraints, called CSMAB-F, aiming to address the aforementioned crucial modeling issues. The objective is now to maximize the reward while satisfying the fairness requirement of a minimum selection fraction for each individual arm. To tackle this new problem, we extend an online learning algorithm, called Upper Confidence Bound (UCB), to deal with a critical tradeoff between exploitation and exploration and employ the virtual queue technique to properly handle the fairness constraints. By carefully integrating these two techniques, we develop a new algorithm, called Learning with Fairness Guarantee (LFG), for the CSMAB-F problem. Further, we rigorously prove that not only LFG is feasibility-optimal, but it also has a timeaverage regret upper bounded by N/2 eta + beta(1)root mNTlogT vertical bar beta(2) N, where N is the total number of arms, m is the maximum number of arms that can be simultaneously played, T is the time horizon, beta(1) and beta(2) are constants, and h is a design parameter that we can tune. Finally, we perform extensive simulations to corroborate the effectiveness of the proposed algorithm. Interestingly, the simulation results reveal an important tradeoff between the regret and the speed of convergence to a point satisfying the fairness constraints.
引用
收藏
页码:1799 / 1813
页数:15
相关论文
共 33 条
[1]   ASYMPTOTICALLY EFFICIENT ALLOCATION RULES FOR THE MULTIARMED BANDIT PROBLEM WITH MULTIPLE PLAYS .1. IID REWARDS [J].
ANANTHARAM, V ;
VARAIYA, P ;
WALRAND, J .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1987, 32 (11) :968-976
[2]  
[Anonymous], 2001, Wireless communications: principles and practice
[3]  
[Anonymous], 2011, Multi-armed bandit allocation indices
[4]   Finite-time analysis of the multiarmed bandit problem [J].
Auer, P ;
Cesa-Bianchi, N ;
Fischer, P .
MACHINE LEARNING, 2002, 47 (2-3) :235-256
[5]   Bandits with Knapsacks [J].
Badanidiyuru, Ashwinkumar ;
Kleinberg, Robert ;
Slivkins, Aleksandrs .
JOURNAL OF THE ACM, 2018, 65 (03)
[6]  
Basik F., 2018, IEEE T SERV COMPUT, V18, P1
[7]   Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems [J].
Bubeck, Sebastien ;
Cesa-Bianchi, Nicolo .
FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2012, 5 (01) :1-122
[8]   A Passive Heart Rate Measurement Method Using Camera [J].
Cai, Kai ;
Chen, Jianxin ;
Zhang, Lujia ;
Lin, Qingyu ;
Ji, Hui .
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON COMPUTING AND PATTERN RECOGNITION (ICCPR 2018), 2018, :68-72
[9]  
Chatterjee A, 2017, CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE (UAI2017)
[10]  
Chen K., 2018, ARXIV180501702