Distributed cooperative decision making in multi-agent multi-armed bandits

被引：45

作者：

Landgren, Peter ^{[1
]}

Srivastava, Vaibhav ^{[2
]}

Leonard, Naomi Ehrich ^{[1
]}

机构：

[1] Princeton Univ, Dept Mech & Aerosp Engn, Princeton, NJ 08544 USA

[2] Michigan State Univ, Dept Elect & Comp Engn, E Lansing, MI 48824 USA

来源：

AUTOMATICA | 2021年 / 125卷

关键词：

Multi-armed bandits; Multi-agent systems; Distributed decision making; Explore-exploit dilemma;

D O I：

10.1016/j.automatica.2020.109445

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study a distributed decision-making problem in which multiple agents face the same multi-armed bandit (MAB), and each agent makes sequential choices among arms to maximize its own individual reward. The agents cooperate by sharing their estimates over a fixed communication graph. We consider an unconstrained reward model in which two or more agents can choose the same arm and collect independent rewards. And we consider a constrained reward model in which agents that choose the same arm at the same time receive no reward. We design a dynamic, consensus-based, distributed estimation algorithm for cooperative estimation of mean rewards at each arm. We leverage the estimates from this algorithm to develop two distributed algorithms: coop-UCB2 and coop-UCB2-selective-learning, for the unconstrained and constrained reward models, respectively. We show that both algorithms achieve group performance close to the performance of a centralized fusion center. Further, we investigate the influence of the communication graph structure on performance. We propose a novel graph explore-exploit index that predicts the relative performance of groups in terms of the communication graph, and we propose a novel nodal explore-exploit centrality index that predicts the relative performance of agents in terms of the agent locations in the communication graph. (C) 2020 Elsevier Ltd. All rights reserved.

引用

页数：13

共 32 条

[1] Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret [J].

Anandkumar, Animashree ;

Michael, Nithin ;

Tang, Kevin ;

Swami, Ananthram .

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2011, 29 (04) :731-745

[2] ASYMPTOTICALLY EFFICIENT ALLOCATION RULES FOR THE MULTIARMED BANDIT PROBLEM WITH MULTIPLE PLAYS .1. IID REWARDS [J].

ANANTHARAM, V ;

VARAIYA, P ;

WALRAND, J .

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1987, 32 (11) :968-976

[3]

[Anonymous], 1998, RANDOM GRAPHS

[4] Finite-time analysis of the multiarmed bandit problem [J].

Auer, P ;

Cesa-Bianchi, N ;

Fischer, P .

MACHINE LEARNING, 2002, 47 (2-3) :235-256

[5]

Bistritz I., 2018, PROC 32 INT C NEURAL, V31, P7222

[6]

Boucheron S., 2016, CONCENTRATION INEQUA

[7] Enforcing consensus while monitoring the environment in Wireless Sensor Networks [J].

Braca, Paolo ;

Marano, Stefano ;

Matta, Vincenzo .

IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2008, 56 (07) :3375-3380

[8] Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems [J].

Bubeck, Sebastien ;

Cesa-Bianchi, Nicolo .

FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2012, 5 (01) :1-122

[9]

Bullo F, 2009, PRINC SER APPL MATH, P1

[10]

Cheung MY, 2013, IEEE INT C INT ROBOT, P3368, DOI 10.1109/IROS.2013.6696836

← 1 2 3 4 →