Distributed cooperative decision making in multi-agent multi-armed bandits

被引:45
作者
Landgren, Peter [1 ]
Srivastava, Vaibhav [2 ]
Leonard, Naomi Ehrich [1 ]
机构
[1] Princeton Univ, Dept Mech & Aerosp Engn, Princeton, NJ 08544 USA
[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA
关键词
Multi-armed bandits; Multi-agent systems; Distributed decision making; Explore-exploit dilemma;
D O I
10.1016/j.automatica.2020.109445
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study a distributed decision-making problem in which multiple agents face the same multi-armed bandit (MAB), and each agent makes sequential choices among arms to maximize its own individual reward. The agents cooperate by sharing their estimates over a fixed communication graph. We consider an unconstrained reward model in which two or more agents can choose the same arm and collect independent rewards. And we consider a constrained reward model in which agents that choose the same arm at the same time receive no reward. We design a dynamic, consensus-based, distributed estimation algorithm for cooperative estimation of mean rewards at each arm. We leverage the estimates from this algorithm to develop two distributed algorithms: coop-UCB2 and coop-UCB2-selective-learning, for the unconstrained and constrained reward models, respectively. We show that both algorithms achieve group performance close to the performance of a centralized fusion center. Further, we investigate the influence of the communication graph structure on performance. We propose a novel graph explore-exploit index that predicts the relative performance of groups in terms of the communication graph, and we propose a novel nodal explore-exploit centrality index that predicts the relative performance of agents in terms of the agent locations in the communication graph. (C) 2020 Elsevier Ltd. All rights reserved.
引用
收藏
页数:13
相关论文
共 32 条
[1]   Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret [J].
Anandkumar, Animashree ;
Michael, Nithin ;
Tang, Kevin ;
Swami, Ananthram .
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2011, 29 (04) :731-745
[2]   ASYMPTOTICALLY EFFICIENT ALLOCATION RULES FOR THE MULTIARMED BANDIT PROBLEM WITH MULTIPLE PLAYS .1. IID REWARDS [J].
ANANTHARAM, V ;
VARAIYA, P ;
WALRAND, J .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1987, 32 (11) :968-976
[3]  
[Anonymous], 1998, RANDOM GRAPHS
[4]   Finite-time analysis of the multiarmed bandit problem [J].
Auer, P ;
Cesa-Bianchi, N ;
Fischer, P .
MACHINE LEARNING, 2002, 47 (2-3) :235-256
[5]  
Bistritz I., 2018, PROC 32 INT C NEURAL, V31, P7222
[6]  
Boucheron S., 2016, CONCENTRATION INEQUA
[7]   Enforcing consensus while monitoring the environment in Wireless Sensor Networks [J].
Braca, Paolo ;
Marano, Stefano ;
Matta, Vincenzo .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2008, 56 (07) :3375-3380
[8]   Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems [J].
Bubeck, Sebastien ;
Cesa-Bianchi, Nicolo .
FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2012, 5 (01) :1-122
[9]  
Bullo F, 2009, PRINC SER APPL MATH, P1
[10]  
Cheung MY, 2013, IEEE INT C INT ROBOT, P3368, DOI 10.1109/IROS.2013.6696836