A novel action decision method of deep reinforcement learning based on a neural network and confidence bound

被引：2

作者：

Zhang, Wenhao ^{[1
]}

Song, Yaqing ^{[1
]}

Liu, Xiangpeng ^{[1
]}

Shangguan, Qianqian ^{[1
]}

An, Kang ^{[1
]}

机构：

[1] Shanghai Normal Univ, Coll Informat Mech & Elect Engn, Shanghai 201418, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 18期

基金：

中国国家自然科学基金; 上海市自然科学基金;

关键词：

UCB; Exploration and exploitation; Deep reinforcement learning; Machine learning;

D O I：

10.1007/s10489-023-04695-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

From the perspective of the deep reinforcement learning algorithm, the training effect of the agent will be affected because of the excessive randomness of the e-greedy method. This paper proposes a novel action decision method to replace the e-greedy method and avoid excessive randomness. First, a confidence bound span fitting model based on a deep neural network is proposed to fundamentally solve the problem that UCB cannot estimate the confidence bound span of each action in high-dimensional state space. Then, a confidence bound span balance model based on target value in reverse order is proposed. The parameters of the U network are updated after each action decision using the backpropagation of the neural network to balance the confidence bound span. Finally, an exploration-exploitation dynamic balance factor a is introduced to balance exploration and exploitation in the training process. Experiments are conducted using the Nature DQN and Double DQN algorithms, and the results demonstrate that the proposed method achieves higher performance than the e-greedy method under the basic algorithm and experimental environment of this paper. The method presented in this paper has significance for applying a confidence bound to solve complex reinforcement problems.

引用

页码：21299 / 21311

页数：13

共 37 条

[1] Identification of Top-K Influencers Based on Upper Confidence Bound and Local Structure
Alshahrani, Mohammed
Zhu, Fuxi
Mekouar, Soufiana
Alghamdi, Mohammed Yahya
Liu, Shichao
[J]. BIG DATA RESEARCH, 2021, 25
[2] Beyer L., 2019, ARXIV
[3] Hierarchical learning from human preferences and curiosity
Bougie, Nicolas
Ichise, Ryutaro
[J]. APPLIED INTELLIGENCE, 2022, 52 (07) : 7459 - 7479
[4] Fast and slow curiosity for high-level exploration in reinforcement learning
Bougie, Nicolas
Ichise, Ryutaro
[J]. APPLIED INTELLIGENCE, 2021, 51 (02) : 1086 - 1107
[5] Colas C, 2018, PR MACH LEARN RES, V80
[6] De Ath G., 2021, ACM Transact. Evolut. Learn. Optimiz, V1, P1
[7] gymlibrary, Gym documentation
[8] MULTILAYER FEEDFORWARD NETWORKS ARE UNIVERSAL APPROXIMATORS
HORNIK, K
STINCHCOMBE, M
WHITE, H
[J]. NEURAL NETWORKS, 1989, 2 (05) : 359 - 366
[9] Learning for a Robot: Deep Reinforcement Learning, Imitation Learning, Transfer Learning
Hua, Jiang
Zeng, Liangcai
Li, Gongfa
Ju, Zhaojie
[J]. SENSORS, 2021, 21 (04) : 1 - 21
[10] Machine Learning and Deep Learning in smart manufacturing: The Smart Grid paradigm
Kotsiopoulos, Thanasis
Sarigiannidis, Panagiotis
Ioannidis, Dimosthenis
Tzovaras, Dimitrios
[J]. COMPUTER SCIENCE REVIEW, 2021, 40

← 1 2 3 4 →