Decentralized graph-based multi-agent reinforcement learning using reward machines

被引：6

作者：

Hu, Jueming ^{[1
]}

Xu, Zhe ^{[1
]}

Wang, Weichang ^{[2
]}

Qu, Guannan ^{[3
]}

Pang, Yutian ^{[1
]}

Liu, Yongming ^{[1
]}

机构：

[1] Arizona State Univ, Sch Engn Matter Transport & Energy, Tempe, AZ 85287 USA

[2] Arizona State Univ, Sch Elect Comp & Energy Engn, Tempe, AZ USA

[3] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA

来源：

NEUROCOMPUTING | 2024年 / 564卷

基金：

美国国家科学基金会;

关键词：

Decentralized; Multi-agent; Reinforcement learning; Reward machine; Efficiency; SUBGOAL AUTOMATA; ALGORITHMS; INDUCTION;

D O I：

10.1016/j.neucom.2023.126974

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In multi-agent reinforcement learning (MARL), it is challenging for a collection of agents to learn complex temporally extended tasks. The difficulties lie in computational complexity and how to learn the high-level ideas behind reward functions. We study the graph-based Markov Decision Process (MDP), where the dynamics of neighboring agents are coupled. To learn complex temporally extended tasks, we use a reward machine (RM) to encode each agent's task and expose reward function internal structures. RM has the capacity to describe high-level knowledge and encode non-Markovian reward functions. We propose a decentralized learning algorithm to tackle computational complexity, called decentralized graph-based reinforcement learning using reward machines (DGRM), that equips each agent with a localized policy, allowing agents to make decisions independently based on the information available to the agents. DGRM uses the actor-critic structure, and we introduce the tabular Q-function for discrete state problems. We show that the dependency of the Q-function on other agents decreases exponentially as the distance between them increases. To further improve efficiency, we also propose the deep DGRM algorithm, using deep neural networks to approximate the Q-function and policy function to solve large-scale or continuous state problems. The effectiveness of the proposed DGRM algorithm is evaluated by three case studies, two wireless communication case studies with independent and dependent reward functions, respectively, and COVID-19 pandemic mitigation. Experimental results show that local information is sufficient for DGRM and agents can accomplish complex tasks with the help of RM. DGRM improves the global accumulated reward by 119% compared to the baseline in the case of COVID-19 pandemic mitigation.

引用

页数：11

共 40 条

[1]

Andrychowicz M., 2018, arXiv

[2] A survey of computational complexity results in systems and control [J].

Blondel, VD ;

Tsitsiklis, JN .

AUTOMATICA, 2000, 36 (09) :1249-1274

[3]

Cheng Q., 2013, Advances in Neural Information Processing Systems, P2976

[4]

Claus C, 1998, FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, P746

[5]

Cubuktepe M., 2021, IEEE Trans. Control Netw. Syst.

[6]

Dean T., 1989, Computational Intelligence, V5, P142, DOI 10.1111/j.1467-8640.1989.tb00324.x

[7]

DeFazio D, 2024, Arxiv, DOI arXiv:2107.10969

[8]

Della RossaF., 2020, Intermittent yet coordinated regional strategies can alleviate the covid-19 epidemic: A network model of the italian case

[9] Hierarchical reinforcement learning with the MAXQ value function decomposition [J].

Dietterich, TG .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2000, 13 :227-303

[10]

Foerster JN, 2018, AAAI CONF ARTIF INTE, P2974

← 1 2 3 4 →