Reinforcement learning with predefined and inferred reward machines in stochastic games

被引:0
作者
Hu, Jueming [1 ]
Paliwal, Yash [1 ]
Kim, Hyohun [1 ]
Wang, Yanze [1 ]
Xu, Zhe [1 ]
机构
[1] Arizona State Univ, Tempe, AZ 85281 USA
关键词
Reinforcement learning; Non-Markovian rewards; Reward machine; Non-cooperative stochastic game;
D O I
10.1016/j.neucom.2024.128170
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper focuses on Multi-Agent Reinforcement Learning (MARL) in non-cooperative stochastic games, particularly addressing the challenge of task completion characterized by non-Markovian reward functions. We employ Reward Machines (RMs) to incorporate high-level task knowledge. Firstly, we introduce Q-learning with R eward M achines for S tochastic G ames (QRM-SG), where RMs are predefined and available to agents. QRM-SG learns each agent's best-response policy at Nash equilibrium by defining the Q-function in augmented state space that integrates the stochastic game and RM states. The Lemke-Howson method is utilized to compute the best-response policies for the stage game defined by the current Q-functions at each time step. Subsequently, we explore a more challenging scenario where RMs are unavailable and propose M ulti-Agent A gent R einforcement learning with C oncurrent H igh-level knowledge inference (MARCH). MARCH uses automata learning to learn RMs iteratively and combines this process with QRM-SG for learning the best-response policies. The RL episodes where the obtained rewards are inconsistent with the rewards from the current RMs trigger the inference of new RMs. We prove QRM-SG and MARCH converge to the best-response policies under certain conditions. Two scenarios are conducted to demonstrate the superior performance of QRM-SG and MARCH compared to baseline methods.
引用
收藏
页数:19
相关论文
共 47 条
[21]   EQUILIBRIUM POINTS OF BIMATRIX GAMES [J].
LEMKE, CE ;
HOWSON, JT .
JOURNAL OF THE SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS, 1964, 12 (02) :413-423
[22]  
Le¢n BG, 2020, Arxiv, DOI arXiv:2002.06000
[23]  
Levine S, 2018, Arxiv, DOI [arXiv:1805.00909, 10.48550/arXiv.1805.00909, DOI 10.48550/ARXIV.1805.00909]
[24]  
Li X, 2017, IEEE INT C INT ROBOT, P3834, DOI 10.1109/IROS.2017.8206234
[25]   Multi-agent Inverse Reinforcement Learning for Certain General-Sum Stochastic Games [J].
Lin, Xiaomin ;
Adams, Stephen C. ;
Beling, Peter A. .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2019, 66 :473-502
[26]  
Lin ZY, 2021, Arxiv, DOI arXiv:1709.03969
[27]  
Lowe R, 2017, ADV NEUR IN, V30
[28]  
Melo Francisco S, 2001, Tech. Rep, P1
[29]  
Muniraj D, 2018, IEEE DECIS CONTR P, P4141, DOI 10.1109/CDC.2018.8618746
[30]  
NASH J, 1951, ANN MATH, V54, P286, DOI 10.2307/1969529