Elastic step DQN: A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks

被引：11

作者：

Ly, Adrian ^{[1
]}

Dazeley, Richard ^{[1
]}

Vamplew, Peter ^{[3
]}

Cruz, Francisco ^{[1
,2
,4
]}

Aryal, Sunil ^{[1
]}

机构：

[1] Deakin Univ, Geelong, Vic 3220, Australia

[2] UNSW, Sydney, NSW 2052, Australia

[3] Federat Univ Australia, Ballarat, Vic 3350, Australia

[4] Univ Cent Chile, Santiago 8330601, Chile

来源：

NEUROCOMPUTING | 2024年 / 576卷

关键词：

Reinforcement learning; DQN; Multi-step update; Overestimation; Neural network; REINFORCEMENT; TUTORIAL;

D O I：

10.1016/j.neucom.2023.127170

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Deep Q -Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to successfully surpass human level performance in a number of Atari learning environments. However, divergent and unstable behaviour have been long standing issues in DQNs. The unstable behaviour is often characterised by overestimation in the Q -values, commonly referred to as the overestimation bias. To address the overestimation bias and the divergent behaviour, a number of heuristic extensions have been proposed. Notably, multi -step updates have been shown to drastically reduce unstable behaviour while improving agent's training performance. However, agents are often highly sensitive to the selection of the multi -step update horizon (n), and our empirical experiments show that a poorly chosen static value for n can in many cases lead to worse performance than single-step DQN. Inspired by the success of n -step DQN and the effects that multi -step updates have on overestimation bias, this paper proposes a new algorithm that we call 'Elastic Step DQN' (ES-DQN) to alleviate overestimation bias in DQNs. ES-DQN dynamically varies the step size horizon in multi -step updates based on the similarity between states visited. Our empirical evaluation shows that ES-DQN out -performs n -step with fixed n updates, Double DQN and Average DQN in several OpenAI Gym environments while at the same time alleviating the overestimation bias.

引用

页数：13

共 50 条

[1]

Anschel O, 2017, 34 INT C MACHINE LEA, V70

[2] Purposive behavior acquisition for a real robot by vision-based reinforcement learning [J].

Asada, M ;

Noda, S ;

Tawaratsumida, S ;

Hosoda, K .

MACHINE LEARNING, 1996, 23 (2-3) :279-303

[3]

Baird L, 1995, Machine Learning Proceedings 1995, P30, DOI DOI 10.1016/B978-1-55860-377-6.50013-X

[4]

Brockman G, 2016, Arxiv, DOI arXiv:1606.01540

[5]

Çayir A, 2018, 2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), P494, DOI 10.1109/UBMK.2018.8566383

[6]

Chiang PH, 2020, Arxiv, DOI arXiv:2007.08229

[7]

Colas C, 2018, PR MACH LEARN RES, V80

[8]

Dabney W, 2018, AAAI CONF ARTIF INTE, P2892

[9]

Dazeley R., 2015, 2 MULT C REINF LEARN

[10]

Ding C., 2004, P 21 INT C MACH LEAR, P29

← 1 2 3 4 5 →