Multi-agent actor-critic with time dynamical opponent model

被引：4

作者：

Tian, Yuan ^{[1
]}

Kladny, Klaus -Rudolf ^{[2
]}

Wang, Qin ^{[3
]}

Huang, Zhiwu ^{[5
]}

Fink, Olga ^{[4
]}

机构：

[1] Swiss Fed Inst Technol, Intelligent Maintenance Syst, Zurich, Switzerland

[2] Swiss Fed Inst Technol, Data Sci, Zurich, Switzerland

[3] Swiss Fed Inst Technol, Zurich, Switzerland

[4] Ecole Polytech Fed Lausanne, Intelligent Maintenance & Operat Syst, Lausanne, Switzerland

[5] Singapore Management Univ, Comp Sci, Singapore, Singapore

来源：

NEUROCOMPUTING | 2023年 / 517卷

基金：

瑞士国家科学基金会;

关键词：

Reinforcement learning; Multi -agent reinforcement learning; Multi -agent systems; Opponent modeling; Non-stationarity; LEVEL;

D O I：

10.1016/j.neucom.2022.10.045

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In multi-agent reinforcement learning, multiple agents learn simultaneously while interacting with a common environment and each other. Since the agents adapt their policies during learning, not only the behavior of a single agent becomes non-stationary, but also the environment as perceived by the agent. This renders it particularly challenging to perform policy improvement. In this paper, we propose to exploit the fact that the agents seek to improve their expected cumulative reward and introduce a novel Time Dynamical Opponent Model (TDOM) to encode the knowledge that the opponent policies tend to improve over time. We motivate TDOM theoretically by deriving a lower bound of the log objective of an individual agent and further propose Multi-Agent Actor-Critic with Time Dynamical Opponent Model (TDOM-AC). We evaluate the proposed TDOM-AC on a differential game and the Multi-agent Particle Environment. We show empirically that TDOM achieves superior opponent behavior prediction during test time. The proposed TDOM-AC methodology outperforms state-of-the-art Actor-Critic methods on the performed tasks in cooperative and especially in mixed cooperative-competitive environments. TDOM-AC results in a more stable training and a faster convergence. Our code is available at https:// github.com/Yuantian013/TDOM-AC.(c) 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

引用

页码：165 / 172

页数：8

共 36 条

[1] Autonomous agents modelling other agents: A comprehensive survey and open problems [J].

Albrecht, Stefano V. ;

Stone, Peter .

ARTIFICIAL INTELLIGENCE, 2018, 258 :66-95

[2]

[Anonymous], 2018, OpenAI Five

[3]

Brown G.W., 1951, ACTIVITY ANAL PRODUC

[4] Superhuman AI for multiplayer poker [J].

Brown, Noam ;

Sandholm, Tuomas .

SCIENCE, 2019, 365 (6456) :885-+

[5]

Du Y., 2021, P 20 INT C AUT AG MU, P456

[6]

Foerster JN, 2018, AAAI CONF ARTIF INTE, P2974

[7]

Haarnoja T, 2018, PR MACH LEARN RES, V80

[8]

Hernandez-Leal P, 2019, Arxiv, DOI arXiv:1707.09183

[9]

Hüttenrauch M, 2019, J MACH LEARN RES, V20, DOI 10.5555/3322706.3361995

[10] An introduction to variational methods for graphical models [J].

Jordan, MI ;

Ghahramani, Z ;

Jaakkola, TS ;

Saul, LK .

MACHINE LEARNING, 1999, 37 (02) :183-233

← 1 2 3 4 →