Reinforcement learning with predefined and inferred reward machines in stochastic games

被引：0

作者：

Hu, Jueming ^{[1
]}

Paliwal, Yash ^{[1
]}

Kim, Hyohun ^{[1
]}

Wang, Yanze ^{[1
]}

Xu, Zhe ^{[1
]}

机构：

[1] Arizona State Univ, Tempe, AZ 85281 USA

来源：

NEUROCOMPUTING | 2024年 / 599卷

关键词：

Reinforcement learning; Non-Markovian rewards; Reward machine; Non-cooperative stochastic game;

D O I：

10.1016/j.neucom.2024.128170

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper focuses on Multi-Agent Reinforcement Learning (MARL) in non-cooperative stochastic games, particularly addressing the challenge of task completion characterized by non-Markovian reward functions. We employ Reward Machines (RMs) to incorporate high-level task knowledge. Firstly, we introduce Q-learning with R eward M achines for S tochastic G ames (QRM-SG), where RMs are predefined and available to agents. QRM-SG learns each agent's best-response policy at Nash equilibrium by defining the Q-function in augmented state space that integrates the stochastic game and RM states. The Lemke-Howson method is utilized to compute the best-response policies for the stage game defined by the current Q-functions at each time step. Subsequently, we explore a more challenging scenario where RMs are unavailable and propose M ulti-Agent A gent R einforcement learning with C oncurrent H igh-level knowledge inference (MARCH). MARCH uses automata learning to learn RMs iteratively and combines this process with QRM-SG for learning the best-response policies. The RL episodes where the obtained rewards are inconsistent with the rewards from the current RMs trigger the inference of new RMs. We prove QRM-SG and MARCH converge to the best-response policies under certain conditions. Two scenarios are conducted to demonstrate the superior performance of QRM-SG and MARCH compared to baseline methods.

引用

页数：19

共 47 条

[21] EQUILIBRIUM POINTS OF BIMATRIX GAMES [J].

LEMKE, CE ;

HOWSON, JT .

JOURNAL OF THE SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS, 1964, 12 (02) :413-423

[22]

Le¢n BG, 2020, Arxiv, DOI arXiv:2002.06000

[23]

Levine S, 2018, Arxiv, DOI [arXiv:1805.00909, 10.48550/arXiv.1805.00909, DOI 10.48550/ARXIV.1805.00909]

[24]

Li X, 2017, IEEE INT C INT ROBOT, P3834, DOI 10.1109/IROS.2017.8206234

[25] Multi-agent Inverse Reinforcement Learning for Certain General-Sum Stochastic Games [J].

Lin, Xiaomin ;

Adams, Stephen C. ;

Beling, Peter A. .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2019, 66 :473-502

[26]

Lin ZY, 2021, Arxiv, DOI arXiv:1709.03969

[27]

Lowe R, 2017, ADV NEUR IN, V30

[28]

Melo Francisco S, 2001, Tech. Rep, P1

[29]

Muniraj D, 2018, IEEE DECIS CONTR P, P4141, DOI 10.1109/CDC.2018.8618746

[30]

NASH J, 1951, ANN MATH, V54, P286, DOI 10.2307/1969529

← 1 2 3 4 5 →