Reinforcement learning with predefined and inferred reward machines in stochastic games

被引：0

作者：

Hu, Jueming ^{[1
]}

Paliwal, Yash ^{[1
]}

Kim, Hyohun ^{[1
]}

Wang, Yanze ^{[1
]}

Xu, Zhe ^{[1
]}

机构：

[1] Arizona State Univ, Tempe, AZ 85281 USA

来源：

NEUROCOMPUTING | 2024年 / 599卷

关键词：

Reinforcement learning; Non-Markovian rewards; Reward machine; Non-cooperative stochastic game;

D O I：

10.1016/j.neucom.2024.128170

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper focuses on Multi-Agent Reinforcement Learning (MARL) in non-cooperative stochastic games, particularly addressing the challenge of task completion characterized by non-Markovian reward functions. We employ Reward Machines (RMs) to incorporate high-level task knowledge. Firstly, we introduce Q-learning with R eward M achines for S tochastic G ames (QRM-SG), where RMs are predefined and available to agents. QRM-SG learns each agent's best-response policy at Nash equilibrium by defining the Q-function in augmented state space that integrates the stochastic game and RM states. The Lemke-Howson method is utilized to compute the best-response policies for the stage game defined by the current Q-functions at each time step. Subsequently, we explore a more challenging scenario where RMs are unavailable and propose M ulti-Agent A gent R einforcement learning with C oncurrent H igh-level knowledge inference (MARCH). MARCH uses automata learning to learn RMs iteratively and combines this process with QRM-SG for learning the best-response policies. The RL episodes where the obtained rewards are inconsistent with the rewards from the current RMs trigger the inference of new RMs. We prove QRM-SG and MARCH converge to the best-response policies under certain conditions. Two scenarios are conducted to demonstrate the superior performance of QRM-SG and MARCH compared to baseline methods.

引用

页数：19

共 47 条

[1] Ardon L, 2023, Arxiv, DOI arXiv:2303.14061
[2] Reward Machines for Vision-Based Robotic Manipulation
Camacho, Alberto
Varley, Jacob
Deng, Andy
Jain, Deepali
Iscen, Atil
Kalashnikov, Dmitry
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 14284 - 14290
[3] Corazza J, 2022, AAAI CONF ARTIF INTE, P6429
[4] De Giacomo G, 2020, KR2020: PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PRINCIPLES OF KNOWLEDGE REPRESENTATION AND REASONING, P860
[5] De Giacomo G, 2020, AAAI CONF ARTIF INTE, V34, P13659
[6] DeNero J, 2010, AAAI CONF ARTIF INTE, P1885
[7] Dohmen T., 2022, P INT C AUTOMATED, P574, DOI DOI 10.1609/ICAPS.V32I1.19844
[8] Duan Z., 2023, P 2023 INT C AUT AG, P233
[9] Measuring the Impact of Memory Replay in Training Pacman Agents using Reinforcement Learning
Fallas-Moya, Fabian
Duncan, Jeremiah
Samuel, Tabitha
Sadovnik, Amir
[J]. 2021 XLVII LATIN AMERICAN COMPUTING CONFERENCE (CLEI 2021), 2021,
[10] Furelos-Blanco D., 2023, P MACHINE LEARNING R, P10494

← 1 2 3 4 5 →