Sample-Efficient Reinforcement Learning of Partially Observable Markov Games

被引：0

作者：

Liu, Qinghua ^{[1
]}

Szepesvari, Csaba ^{[2
,3
]}

Jin, Chi ^{[1
]}

机构：

[1] Princeton Univ, Princeton, NJ 08544 USA

[2] DeepMind, Edmonton, AB, Canada

[3] Univ Alberta, Edmonton, AB, Canada

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022 | 2022年

基金：

加拿大自然科学与工程研究理事会;

关键词：

COMPLEXITY;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs-weakly revealing POMGs-in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.

引用

页数：13

共 45 条

[1]

[Anonymous], 2016, Conference on Learning Theory, PMLR, P193

[2]

Azar MG, 2017, PR MACH LEARN RES, V70

[3]

Bai Y., 2022, ARXIV220201752

[4]

Bai Yu, 2020, Advances in Neural Information Processing Systems, V33

[5]

Bai Yu, 2020, ARXIV200204017

[6]

Berner C., 2019, DOTA 2 LARGE SCALE D

[7] R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning [J].

Brafman, RI ;

Tennenholtz, M .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (02) :213-231

[8] Superhuman AI for multiplayer poker [J].

Brown, Noam ;

Sandholm, Tuomas .

SCIENCE, 2019, 365 (6456) :885-+

[9] Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters [J].

Cano, Ignacio ;

Aiyar, Srinivas ;

Krishnamurthy, Arvind .

PROCEEDINGS OF THE SEVENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC 2016), 2016, :29-41

[10]

Dann C, 2017, ADV NEUR IN, V30

← 1 2 3 4 5 →