Generalized multiagent learning with performance bound

被引:13
作者
Banerjee, Bikramjit [1 ]
Peng, Jing [1 ]
机构
[1] Tulane Univ, Dept Elect Engn & Comp Sci, New Orleans, LA 70118 USA
关键词
multiagent reinforcement learning; game theory;
D O I
10.1007/s10458-007-9013-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present new Multiagent learning (MAL) algorithms with the general philosophy of policy convergence against some classes of opponents but otherwise ensuring high payoffs. We consider a 3-class breakdown of opponent types: (eventually) stationary, self-play and "other" (see Definition 4) agents. We start with ReDVaLeR that can satisfy policy convergence against the first two types and no-regret against the third, but it needs to know the type of the opponents. This serves as a baseline to delineate the difficulty of achieving these goals. We show that a simple modification on ReDVaLeR yields a new algorithm, RV (sigma(t)), that achieves no-regret payoffs in all games, and convergence to Nash equilibria in self-play (and to best response against eventually stationary opponents-a corollary of no-regret) simultaneously, without knowing the opponent types, but in a smaller class of games than ReDVaLeR . RV (sigma(t)) effectively ensures the performance of a learner during the process of learning, as opposed to the performance of a learned behavior. We show that the expression for regret of RV (sigma(t)) can have a slightly better form than those of other comparable algorithms like GIGA and GIGA-WoLF though, contrastingly, our analysis is in continuous time. Moreover, experiments show that RV (sigma(t)) can converge to an equilibrium in some cases where GIGA, GIGA-WoLF would fail, and to better equilibria where GIGA, GIGA-WoLF converge to undesirable equilibria (coordination games). This important class of coordination games also highlights the key desirability of policy convergence as a criterion for MAL in self-play instead of high average payoffs. To our knowledge, this is also the first successful (guaranteed) attempt at policy convergence of a no-regret algorithm in the Shapley game.
引用
收藏
页码:281 / 312
页数:32
相关论文
共 36 条
[1]  
[Anonymous], 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume
[2]  
Auer P, 1995, AN S FDN CO, P322, DOI 10.1109/SFCS.1995.492488
[3]  
Banerjee B, 2004, PROCEEDING OF THE NINETEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE SIXTEENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE, P2
[4]   Multiagent learning using a variable learning rate [J].
Bowling, M ;
Veloso, M .
ARTIFICIAL INTELLIGENCE, 2002, 136 (02) :215-250
[5]  
BOWLING M, 2005, P NIPS 2004 5
[6]  
Bowling M, 2001, P 17 INT JOINT C ART, V17, P1021
[7]   R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning [J].
Brafman, RI ;
Tennenholtz, M .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (02) :213-231
[8]  
Claus C, 1998, FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, P746
[9]  
CONITZER V, 2003, P 20 INT C MACH LEAR
[10]  
COVR TM, 1991, ELEMENTS INFORM THEO