The evolutionary dynamics of soft-max policy gradient in multi-agent settings

被引：0

作者：

Bernasconi, Martino ^{[1
]}

Cacciamani, Federico ^{[1
]}

Fioravanti, Simone ^{[2
]}

Gatti, Nicola ^{[1
]}

Trovo, Francesco ^{[1
]}

机构：

[1] Politecn Milan, Milan, Italy

[2] Gran Sasso Sci Inst, Laquila, Italy

来源：

THEORETICAL COMPUTER SCIENCE | 2025年 / 1027卷

关键词：

Game theory; Evolutionary game theory; Reinforcement learning; Multiagent learning; REINFORCEMENT;

D O I：

10.1016/j.tcs.2024.115011

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Policy gradient is one of the most famous algorithms in reinforcement learning. This paper studies the mean dynamics of the soft-max policy gradient algorithm and its properties in multi- agent settings by resorting to evolutionary game theory and dynamical system tools. Unlike most multi-agent reinforcement learning algorithms, whose mean dynamics are a slight variant of the replicator dynamics not affecting the properties of the original dynamics, the soft-max policy gradient dynamics presents a structure significantly different from that of the replicator. In particular, we show that the soft-max policy gradient dynamics in a given game are equivalent to the replicator dynamics in an auxiliary game obtained by a non-convex transformation of the payoffs of the original game. Such a structure gives the dynamics several non-standard properties. The first property we study concerns the convergence to the best response. In particular, while the continuous-time mean dynamics always converge to the best response, the crucial question concerns the convergence speed. Precisely, we show that the space of initializations can be split into two complementary sets such that the trajectories initialized from points of the first set (said good initialization region) directly move to the best response. In contrast, those initialized from points of the second set (said bad initialization region) move first to a series of sub-optimal strategies and then to the best response. Interestingly, in multi-agent adversarial machine learning environments, we show that an adversary can exploit this property to make any current strategy of the learning agent using the soft-max policy gradient fall inside a bad initialization region, thus slowing its learning process and exploiting that policy. When the soft-max policy gradient dynamics is studied in multi-population games, modeling the learning dynamics in self-play, we show that the dynamics preserve the volume of the set of initial points. This property proves that the dynamics cannot converge when the only equilibrium of the game is fully mixed, as the volume of the set of initial points would need to shrink. We also give empirical evidence that the volume expands over time, suggesting that the dynamics in games with fully-mixed equilibrium is chaotic.

引用

页数：23

共 50 条

[1] Evolutionary Dynamics of Multi-agent Formation
Qin, Jin
Ban, Xiaojuan
Li, Xin
CCDC 2009: 21ST CHINESE CONTROL AND DECISION CONFERENCE, VOLS 1-6, PROCEEDINGS, 2009, : 3557 - 3561
[2] Evolutionary Dynamics of Multi-Agent Learning: A Survey
Bloembergen, Daan
Tuyls, Karl
Hennes, Daniel
Kaisers, Michael
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2015, 53 : 659 - 697
[3] Twin Delayed Multi-Agent Deep Deterministic Policy Gradient
Zhan, Mengying
Chen, Jinchao
Du, Chenglie
Duan, Yuxin
PROCEEDINGS OF THE 2021 IEEE INTERNATIONAL CONFERENCE ON PROGRESS IN INFORMATICS AND COMPUTING (PIC), 2021, : 48 - 52
[4] Analysis of Evolutionary Dynamics for Bidding Strategy Driven by Multi-Agent Reinforcement Learning
Zhu, Ziqing
Chan, Ka Wing
Bu, Siqi
Or, Siu Wing
Gao, Xiang
Xia, Shiwei
IEEE TRANSACTIONS ON POWER SYSTEMS, 2021, 36 (06) : 5975 - 5978
[5] Evolutionary Dynamics and Individual Heterogeneity in Multi-agent networking systems
Zhang Jianlei
Chen Zengqiang
Liu Zhongxin
Zhang Chunyan
PROCEEDINGS OF THE 35TH CHINESE CONTROL CONFERENCE 2016, 2016, : 7640 - 7645
[6] Replicator Dynamics for Multi-agent Learning: An Orthogonal Approach
Kaisers, Michael
Tuyls, Karl
ADAPTIVE AND LEARNING AGENTS, 2010, 5924 : 49 - +
[7] QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning
Rehman, Hafiz Muhammad Raza Ur
On, Byung-Won
Ningombam, Devarani Devi
Yi, Sungwon
Choi, Gyu Sang
IEEE ACCESS, 2021, 9 : 129728 - 129741
[8] Multi-Agent Deep Deterministic Policy Gradient Method Based on Double Critics
Ding S.
Du W.
Guo L.
Zhang J.
Xu X.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (10): : 2394 - 2404
[9] Multi-Agent Distributed Deep Deterministic Policy Gradient for Partially Observable Tracking
Fan, Dongyu
Shen, Haikuo
Dong, Lijing
ACTUATORS, 2021, 10 (10)
[10] Strategy Competition Dynamics of Multi-Agent Systems in the Framework of Evolutionary Game Theory
Zhang, Jianlei
Cao, Ming
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2020, 67 (01) : 152 - 156

← 1 2 3 4 5 →