Mean Field Equilibrium in Multi-Armed Bandit Game with Continuous Reward

被引:0
作者
Wang, Xiong [1 ]
Jia, Riheng [2 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Zhejiang Normal Univ, Jinhua, Peoples R China
来源
PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021 | 2021年
关键词
DYNAMIC-GAMES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mean field game facilitates analyzing multi-armed bandit (MAB) for a large number of agents by approximating their interactions with an average effect. Existing mean field models for multi-agent MAB mostly assume a binary reward function, which leads to tractable analysis but is usually not applicable in practical scenarios. In this paper, we study the mean field bandit game with a continuous reward function. Specifically, we focus on deriving the existence and uniqueness of mean field equilibrium (MFE), thereby guaranteeing the asymptotic stability of the multi-agent system. To accommodate the continuous reward function, we encode the learned reward into an agent state, which is in turn mapped to its stochastic arm playing policy and updated using realized observations. We show that the state evolution is upper semi-continuous, based on which the existence of MFE is obtained. As the Markov analysis is mainly for the case of discrete state, we transform the stochastic continuous state evolution into a deterministic ordinary differential equation (ODE). On this basis, we can characterize a contraction mapping for the ODE to ensure a unique MFE for the bandit game. Extensive evaluations validate our MFE characterization, and exhibit tight empirical regret of the MAB problem.
引用
收藏
页码:3118 / 3124
页数:7
相关论文
共 24 条
[1]   Equilibria of dynamic games with many players: Existence, approximation, and market structure [J].
Adlakha, Sachin ;
Johari, Ramesh ;
Weintraub, Gabriel Y. .
JOURNAL OF ECONOMIC THEORY, 2015, 156 :269-316
[2]  
Al Benwan Khalifa, 2012, J Infect Public Health, V5, P1, DOI 10.1016/j.jiph.2011.07.004
[3]  
Auer P, 2003, SIAM J COMPUT, V32, P48, DOI 10.1137/S0097539701398375
[4]   Finite-time analysis of the multiarmed bandit problem [J].
Auer, P ;
Cesa-Bianchi, N ;
Fischer, P .
MACHINE LEARNING, 2002, 47 (2-3) :235-256
[5]  
Benaïm M, 1999, LECT NOTES MATH, V1709, P1
[6]   LEARNING IN MEAN FIELD GAMES: THE FICTITIOUS PLAY [J].
Cardaliaguet, Pierre ;
Hadikhanloo, Saeed .
ESAIM-CONTROL OPTIMISATION AND CALCULUS OF VARIATIONS, 2017, 23 (02) :569-591
[7]  
Cohen Johanne, 2017, P 31 INT C NEUR INF
[8]   A payoff-based learning procedure and its application to traffic games [J].
Cominetti, Roberto ;
Melo, Emerson ;
Sorin, Sylvain .
GAMES AND ECONOMIC BEHAVIOR, 2010, 70 (01) :71-83
[9]  
Even-Dar E, 2003, J MACH LEARN RES, V5, P1