Windows deep transformer Q-networks: an extended variance reduction architecture for partially observable reinforcement learning

被引:0
作者
Wang, Zijian [1 ,2 ]
Wang, Bin [1 ,2 ]
Dou, Hongbo [1 ,2 ]
Liu, Zhongyuan [1 ,2 ]
机构
[1] China Univ Petr East China, Qingdao Inst Software, Changjiang West Rd, Qingdao 266580, Shandong, Peoples R China
[2] China Univ Petr East China, Coll Comp Sci & Technol, Changjiang West Rd, Qingdao 266580, Shandong, Peoples R China
关键词
Deep reinforcement learning; Deep Q-network; Transformer; Variance reduction;
D O I
10.1007/s10489-024-05867-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Partial Observability Markov Desicion Process (POMDP) is always worth studying in reinforcement learning (RL) due to its universality in the real world. Compared with Markov Decision Processes (MDP), agents in POMDP cannot fully receive information from the environment, which is an obstacle to traditional RL algorithms. One solution is to establishes a sequence-to-sequence model. As the core of deep Q-networks, Transformer has achieved certain outperformed results in dealing with partial observability problems. Nevertheless, deep Q-network has the issue of over-estimation of Q-value, which leads to unstable input data quality in Transformer. With the accumulation of deviation fast, model performance may decline drastically, resulting in severe errors that are fatal to policy learning. In this paper, we note that the previous Q-value overestimation mitigation model is not suitable for Deep Transformer Q-Networks (DTQN) framework, for DTQN is a sequence-to-sequence model, not merely a value optimization model in traditional RL. Therefore, we propose Windows DTQN, based on the reduction of Q-value variance via the synergistic effect of shallow and deep windows. In particular, Windows DTQN ensembles the historical Q-networks through the shallow windows, and estimates the uncertainty of the Q-networks through the deep windows for weight allocation. Our experiments conducted on gridverse environments demonstrate that our model achieves better results than the current mainstream DQN algorithms in POMDP. Compared to DTQN, Windows DTQN increases the average success rate by 5.1% and the average return by 1.11.
引用
收藏
页数:19
相关论文
共 35 条
  • [1] Ajay A, 2022, Arxiv, DOI arXiv:2211.15657
  • [2] Anschel Oron, 2017, P MACHINE LEARNING R, V70
  • [3] Baisero A, 2021, gym-gridverse: Gridworld domains for fully and partially observable reinforcement learning
  • [4] Chebotar Y, 2023, C ROBOT LEARNING, P3909
  • [5] Deep Reinforcement Learning Based Trajectory Planning Under Uncertain Constraints
    Chen, Lienhung
    Jiang, Zhongliang
    Cheng, Long
    Knoll, Alois C.
    Zhou, Mingchuan
    [J]. FRONTIERS IN NEUROROBOTICS, 2022, 16
  • [6] Chen LL, 2021, ADV NEUR IN, V34
  • [7] Esslinger K, 2022, Arxiv, DOI [arXiv:2206.01078, 10.48550/arXiv.2206.01078]
  • [8] Fortunato M, 2019, Arxiv, DOI arXiv:1706.10295
  • [9] Fujimoto S, 2021, ADV NEUR IN, V34
  • [10] Fujimoto S, 2018, PR MACH LEARN RES, V80