Generalized multiagent learning with performance bound

被引：13

作者：

Banerjee, Bikramjit ^{[1
]}

Peng, Jing ^{[1
]}

机构：

[1] Tulane Univ, Dept Elect Engn & Comp Sci, New Orleans, LA 70118 USA

来源：

AUTONOMOUS AGENTS AND MULTI-AGENT SYSTEMS | 2007年 / 15卷 / 03期

关键词：

multiagent reinforcement learning; game theory;

D O I：

10.1007/s10458-007-9013-x

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We present new Multiagent learning (MAL) algorithms with the general philosophy of policy convergence against some classes of opponents but otherwise ensuring high payoffs. We consider a 3-class breakdown of opponent types: (eventually) stationary, self-play and "other" (see Definition 4) agents. We start with ReDVaLeR that can satisfy policy convergence against the first two types and no-regret against the third, but it needs to know the type of the opponents. This serves as a baseline to delineate the difficulty of achieving these goals. We show that a simple modification on ReDVaLeR yields a new algorithm, RV (sigma(t)), that achieves no-regret payoffs in all games, and convergence to Nash equilibria in self-play (and to best response against eventually stationary opponents-a corollary of no-regret) simultaneously, without knowing the opponent types, but in a smaller class of games than ReDVaLeR . RV (sigma(t)) effectively ensures the performance of a learner during the process of learning, as opposed to the performance of a learned behavior. We show that the expression for regret of RV (sigma(t)) can have a slightly better form than those of other comparable algorithms like GIGA and GIGA-WoLF though, contrastingly, our analysis is in continuous time. Moreover, experiments show that RV (sigma(t)) can converge to an equilibrium in some cases where GIGA, GIGA-WoLF would fail, and to better equilibria where GIGA, GIGA-WoLF converge to undesirable equilibria (coordination games). This important class of coordination games also highlights the key desirability of policy convergence as a criterion for MAL in self-play instead of high average payoffs. To our knowledge, this is also the first successful (guaranteed) attempt at policy convergence of a no-regret algorithm in the Shapley game.

引用

页码：281 / 312

页数：32

共 36 条

[1]

[Anonymous], 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume

[2]

Auer P, 1995, AN S FDN CO, P322, DOI 10.1109/SFCS.1995.492488

[3]

Banerjee B, 2004, PROCEEDING OF THE NINETEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE SIXTEENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE, P2

[4] Multiagent learning using a variable learning rate [J].

Bowling, M ;

Veloso, M .

ARTIFICIAL INTELLIGENCE, 2002, 136 (02) :215-250

[5]

BOWLING M, 2005, P NIPS 2004 5

[6]

Bowling M, 2001, P 17 INT JOINT C ART, V17, P1021

[7] R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning [J].

Brafman, RI ;

Tennenholtz, M .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (02) :213-231

[8]

Claus C, 1998, FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, P746

[9]

CONITZER V, 2003, P 20 INT C MACH LEAR

[10]

COVR TM, 1991, ELEMENTS INFORM THEO

← 1 2 3 4 →