Windows deep transformer Q-networks: an extended variance reduction architecture for partially observable reinforcement learning

被引：0

作者：

Wang, Zijian ^{[1
,2
]}

Wang, Bin ^{[1
,2
]}

Dou, Hongbo ^{[1
,2
]}

Liu, Zhongyuan ^{[1
,2
]}

机构：

[1] China Univ Petr East China, Qingdao Inst Software, Changjiang West Rd, Qingdao 266580, Shandong, Peoples R China

[2] China Univ Petr East China, Coll Comp Sci & Technol, Changjiang West Rd, Qingdao 266580, Shandong, Peoples R China

来源：

APPLIED INTELLIGENCE | 2025年 / 55卷 / 01期

关键词：

Deep reinforcement learning; Deep Q-network; Transformer; Variance reduction;

D O I：

10.1007/s10489-024-05867-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Partial Observability Markov Desicion Process (POMDP) is always worth studying in reinforcement learning (RL) due to its universality in the real world. Compared with Markov Decision Processes (MDP), agents in POMDP cannot fully receive information from the environment, which is an obstacle to traditional RL algorithms. One solution is to establishes a sequence-to-sequence model. As the core of deep Q-networks, Transformer has achieved certain outperformed results in dealing with partial observability problems. Nevertheless, deep Q-network has the issue of over-estimation of Q-value, which leads to unstable input data quality in Transformer. With the accumulation of deviation fast, model performance may decline drastically, resulting in severe errors that are fatal to policy learning. In this paper, we note that the previous Q-value overestimation mitigation model is not suitable for Deep Transformer Q-Networks (DTQN) framework, for DTQN is a sequence-to-sequence model, not merely a value optimization model in traditional RL. Therefore, we propose Windows DTQN, based on the reduction of Q-value variance via the synergistic effect of shallow and deep windows. In particular, Windows DTQN ensembles the historical Q-networks through the shallow windows, and estimates the uncertainty of the Q-networks through the deep windows for weight allocation. Our experiments conducted on gridverse environments demonstrate that our model achieves better results than the current mainstream DQN algorithms in POMDP. Compared to DTQN, Windows DTQN increases the average success rate by 5.1% and the average return by 1.11.

引用

页数：19

共 35 条

[21] Human-Level Control Through Directly Trained Deep Spiking Q-Networks
Liu, Guisong
Deng, Wenjie
Xie, Xiurui
Huang, Li
Tang, Huajin
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 7187 - 7198
[22] Elastic step DQN: A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks
Ly, Adrian
Dazeley, Richard
Vamplew, Peter
Cruz, Francisco
Aryal, Sunil
[J]. NEUROCOMPUTING, 2024, 576
[23] Meng L, 2024, Arxiv, DOI arXiv:2206.15269
[24] Human-level control through deep reinforcement learning
Mnih, Volodymyr
Kavukcuoglu, Koray
Silver, David
Rusu, Andrei A.
Veness, Joel
Bellemare, Marc G.
Graves, Alex
Riedmiller, Martin
Fidjeland, Andreas K.
Ostrovski, Georg
Petersen, Stig
Beattie, Charles
Sadik, Amir
Antonoglou, Ioannis
King, Helen
Kumaran, Dharshan
Wierstra, Daan
Legg, Shane
Hassabis, Demis
[J]. NATURE, 2015, 518 (7540) : 529 - 533
[25] Reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, and optimization
Ogunfowora, Oluwaseyi
Najjaran, Homayoun
[J]. JOURNAL OF MANUFACTURING SYSTEMS, 2023, 70 : 244 - 263
[26] Knowledge-Based Hierarchical POMDPs for Task Planning
Serrano, Sergio A.
Santiago, Elizabeth
Martinez-Carranza, Jose
Morales, Eduardo F.
Enrique Sucar, L.
[J]. JOURNAL OF INTELLIGENT & ROBOTIC SYSTEMS, 2021, 101 (04)
[27] Sacral Giant Cell Tumor-Induced Cauda Equina Syndrome: Case Report with Successful Management
Singh, Saraj K.
Kumar, Avinash
Nigam, Jitendra
[J]. JOURNAL OF NEUROSCIENCES IN RURAL PRACTICE, 2021, 12 (02) : 398 - 401
[28] Sorokin I, 2015, Arxiv, DOI arXiv:1512.01693
[29] Sun YQ, 2022, Arxiv, DOI [arXiv:2206.03654, 10.48550/ARXIV.2206.03654, DOI 10.48550/ARXIV.2206.03654]
[30] Deep learning in spiking neural networks
Tavanaei, Amirhossein
Ghodrati, Masoud
Kheradpisheh, Saeed Reza
Masquelier, Timothee
Maida, Anthony
[J]. NEURAL NETWORKS, 2019, 111 : 47 - 63

← 1 2 3 4 →