Policy Evaluation and Seeking for Multiagent Reinforcement Learning via Best Response

被引：8

作者：

Yan, Rui ^{[1
]}

Duan, Xiaoming ^{[2
,3
]}

Shi, Zongying ^{[1
]}

Zhong, Yisheng ^{[1
]}

Marden, Jason R. ^{[4
]}

Bullo, Francesco ^{[3
]}

机构：

[1] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China

[2] Univ Calif Santa Barbara, Dept Mech Engn, Santa Barbara, CA 93106 USA

[3] Univ Calif Santa Barbara, Ctr Control Dynam Syst & Computat, Santa Barbara, CA 93106 USA

[4] Univ Calif Santa Barbara, Dept Elect & Comp Engn, Santa Barbara, CA 93106 USA

来源：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL | 2022年 / 67卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Best response; multiagent reinforcement learning; policy evaluation and seeking; sink equilibrium; stochastic stability; GAME; EQUILIBRIA; DYNAMICS; GO;

D O I：

10.1109/TAC.2021.3085171

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multiagent policy evaluation and seeking are long-standing challenges in developing theories for multiagent reinforcement learning (MARL), due to multidimensional learning goals, nonstationary environment, and scalability issues in the joint policy space. This article introduces two metrics grounded on a game-theoretic solution concept called sink equilibrium, for the evaluation, ranking, and computation of policies in multiagent learning. We adopt strict best response dynamics (SBRDs) to model selfish behaviors at a meta-level for MARL. Our approach can deal with dynamical cyclical behaviors (unlike approaches based on Nash equilibria and Elo ratings), and is more compatible with single-agent reinforcement learning than 0-rank, which relies on weakly better responses. We first consider settings where the difference between the largest and second largest equilibrium metric has a known lower bound. With this knowledge, we propose a class of perturbed SBRD with the following property: only policies with maximum metric are observed with nonzero probability for a broad class of stochastic games with finite memory. We then consider settings where the lower bound for the difference is unknown. For this setting, we propose a class of perturbed SBRD such that the metrics of the policies observed with nonzero probability differ from the optimal by any given tolerance. The proposed perturbed SBRD addresses the scalability issue and opponent-induced nonstationarity by fixing the strategies of others for the learning agent, and uses empirical game-theoretic analysis to estimate payoffs for each strategy profile obtained due to the perturbation.

引用

页码：1898 / 1913

页数：16

共 51 条

[1] [Anonymous], 2013, P 14 ACM C EL COMM
[2] Decentralized Q-Learning for Stochastic Teams and Games
Arslan, Gurdal
Yuksel, Serdar
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2017, 62 (04) : 1545 - 1558
[3] Blum A, 2008, ACM S THEORY COMPUT, P373
[4] Learning to Play Efficient Coarse Correlated Equilibria
Borowski, Holly P.
Marden, Jason R.
Shamma, Jeff S.
[J]. DYNAMIC GAMES AND APPLICATIONS, 2019, 9 (01) : 24 - 46
[5] Brown G. W., 1951, ACTIVITY ANAL PROD A, V13, P374
[6] Bullo F., 2019, Lectures on Network Systems, V1
[7] CONVERGENT LEARNING ALGORITHMS FOR UNKNOWN REWARD GAMES
Chapman, Archie C.
Leslie, David S.
Rogers, Alex
Jennings, Nicholas R.
[J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2013, 51 (04) : 3154 - 3180
[8] Stochastic Stability of Perturbed Learning Automata in Positive-Utility Games
Chasparis, Georgios C.
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2019, 64 (11) : 4454 - 4469
[9] Elo Arpad E., 1978, The rating of chessplayers: past and present
[10] Fabrikant A, 2008, PROCEEDINGS OF THE NINETEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P844

← 1 2 3 4 5 6 →