Windows deep transformer Q-networks: an extended variance reduction architecture for partially observable reinforcement learning

被引:0
作者
Wang, Zijian [1 ,2 ]
Wang, Bin [1 ,2 ]
Dou, Hongbo [1 ,2 ]
Liu, Zhongyuan [1 ,2 ]
机构
[1] China Univ Petr East China, Qingdao Inst Software, Changjiang West Rd, Qingdao 266580, Shandong, Peoples R China
[2] China Univ Petr East China, Coll Comp Sci & Technol, Changjiang West Rd, Qingdao 266580, Shandong, Peoples R China
关键词
Deep reinforcement learning; Deep Q-network; Transformer; Variance reduction;
D O I
10.1007/s10489-024-05867-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Partial Observability Markov Desicion Process (POMDP) is always worth studying in reinforcement learning (RL) due to its universality in the real world. Compared with Markov Decision Processes (MDP), agents in POMDP cannot fully receive information from the environment, which is an obstacle to traditional RL algorithms. One solution is to establishes a sequence-to-sequence model. As the core of deep Q-networks, Transformer has achieved certain outperformed results in dealing with partial observability problems. Nevertheless, deep Q-network has the issue of over-estimation of Q-value, which leads to unstable input data quality in Transformer. With the accumulation of deviation fast, model performance may decline drastically, resulting in severe errors that are fatal to policy learning. In this paper, we note that the previous Q-value overestimation mitigation model is not suitable for Deep Transformer Q-Networks (DTQN) framework, for DTQN is a sequence-to-sequence model, not merely a value optimization model in traditional RL. Therefore, we propose Windows DTQN, based on the reduction of Q-value variance via the synergistic effect of shallow and deep windows. In particular, Windows DTQN ensembles the historical Q-networks through the shallow windows, and estimates the uncertainty of the Q-networks through the deep windows for weight allocation. Our experiments conducted on gridverse environments demonstrate that our model achieves better results than the current mainstream DQN algorithms in POMDP. Compared to DTQN, Windows DTQN increases the average success rate by 5.1% and the average return by 1.11.
引用
收藏
页数:19
相关论文
共 35 条
  • [21] Human-Level Control Through Directly Trained Deep Spiking Q-Networks
    Liu, Guisong
    Deng, Wenjie
    Xie, Xiurui
    Huang, Li
    Tang, Huajin
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 7187 - 7198
  • [22] Elastic step DQN: A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks
    Ly, Adrian
    Dazeley, Richard
    Vamplew, Peter
    Cruz, Francisco
    Aryal, Sunil
    [J]. NEUROCOMPUTING, 2024, 576
  • [23] Meng L, 2024, Arxiv, DOI arXiv:2206.15269
  • [24] Human-level control through deep reinforcement learning
    Mnih, Volodymyr
    Kavukcuoglu, Koray
    Silver, David
    Rusu, Andrei A.
    Veness, Joel
    Bellemare, Marc G.
    Graves, Alex
    Riedmiller, Martin
    Fidjeland, Andreas K.
    Ostrovski, Georg
    Petersen, Stig
    Beattie, Charles
    Sadik, Amir
    Antonoglou, Ioannis
    King, Helen
    Kumaran, Dharshan
    Wierstra, Daan
    Legg, Shane
    Hassabis, Demis
    [J]. NATURE, 2015, 518 (7540) : 529 - 533
  • [25] Reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, and optimization
    Ogunfowora, Oluwaseyi
    Najjaran, Homayoun
    [J]. JOURNAL OF MANUFACTURING SYSTEMS, 2023, 70 : 244 - 263
  • [26] Knowledge-Based Hierarchical POMDPs for Task Planning
    Serrano, Sergio A.
    Santiago, Elizabeth
    Martinez-Carranza, Jose
    Morales, Eduardo F.
    Enrique Sucar, L.
    [J]. JOURNAL OF INTELLIGENT & ROBOTIC SYSTEMS, 2021, 101 (04)
  • [27] Sacral Giant Cell Tumor-Induced Cauda Equina Syndrome: Case Report with Successful Management
    Singh, Saraj K.
    Kumar, Avinash
    Nigam, Jitendra
    [J]. JOURNAL OF NEUROSCIENCES IN RURAL PRACTICE, 2021, 12 (02) : 398 - 401
  • [28] Sorokin I, 2015, Arxiv, DOI arXiv:1512.01693
  • [29] Sun YQ, 2022, Arxiv, DOI [arXiv:2206.03654, 10.48550/ARXIV.2206.03654, DOI 10.48550/ARXIV.2206.03654]
  • [30] Deep learning in spiking neural networks
    Tavanaei, Amirhossein
    Ghodrati, Masoud
    Kheradpisheh, Saeed Reza
    Masquelier, Timothee
    Maida, Anthony
    [J]. NEURAL NETWORKS, 2019, 111 : 47 - 63