An algorithm that excavates suboptimal states and improves Q-learning

被引：0

作者：

Zhu, Canxin ^{[1
,2
]}

Yang, Jingmin ^{[1
,2
]}

Zhang, Wenjie ^{[1
,2
]}

Zheng, Yifeng ^{[1
,2
]}

机构：

[1] Minnan Normal Univ, Sch Comp Sci, Zhangzhou 363000, Peoples R China

[2] Fuzhou Univ, Affiliated Prov Hosp, Fuzhou 363000, Fujian, Peoples R China

来源：

ENGINEERING RESEARCH EXPRESS | 2024年 / 6卷 / 04期

关键词：

reinforcement learning; exploration and exploitation; markov decision process; suboptimal state;

D O I：

10.1088/2631-8695/ad8dae

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Reinforcement learning is inspired by the trial-and-error method in animal learning, where the reward values obtained from the interaction of the agent with the environment are used as feedback signals to train the agent. Reinforcement learning has attracted extensive attention in recent years. It is mainly used to solve sequential decision-making problems and has been applied to various aspects of life, such as autonomous driving, game gaming, and robotics. Exploration and exploitation are the main characteristics that distinguish reinforcement learning methods from other learning methods. Reinforcement learning methods need reward optimization algorithms to better balance exploration and exploitation. Aiming at the problems of unbalanced exploration and a large number of repeated explorations in the Q-learning algorithm in the MDP environment, an algorithm that excavates suboptimal states and improves Q-learning was proposed. It adopts the exploration idea of 'exploring the potential of the second-best', and explores the state with suboptimal state value, and calculates the exploration probability value according to the distance between the current state and the goal state. The larger the distance, the higher the exploration demand of the agent. In addition, only the immediate reward and the maximum action value of the next state are needed to calculate the Q value. Through the simulation experiments in two different MDP environments, The frozenLake8x8 environment and the CliffWalking environment, the results verify that the proposed algorithm obtains the highest average cumulative reward and the least total time consumption

引用

页数：18

共 50 条

[41] Research on Q-learning algorithm based on Metropolis criterion
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2002, 39 (06):
[42] Power Control Algorithm Based on Q-Learning in Femtocell
Li Y.
Tang Y.
Liu H.
Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2019, 41 (11): : 2557 - 2564
[43] Using Q-learning algorithm for initialization of the GRASP metaheuristic and genetic algorithm
de Lima Junior, Francisco Chagas
de Melo, Jorge Dantas
Doria Neto, Adriao Duarte
2007 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-6, 2007, : 1243 - 1248
[44] An inverse reinforcement learning framework with the Q-learning mechanism for the metaheuristic algorithm
Zhao, Fuqing
Wang, Qiaoyun
Wang, Ling
KNOWLEDGE-BASED SYSTEMS, 2023, 265
[45] ETQ-learning: an improved Q-learning algorithm for path planning
Wang, Huanwei
Jing, Jing
Wang, Qianlv
He, Hongqi
Qi, Xuyan
Lou, Rui
INTELLIGENT SERVICE ROBOTICS, 2024, 17 (04) : 915 - 929
[46] Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning
Ohnishi, Shota
Uchibe, Eiji
Yamaguchi, Yotaro
Nakanishi, Kosuke
Yasui, Yuji
Ishii, Shin
FRONTIERS IN NEUROROBOTICS, 2019, 13
[47] Learning rates for Q-Learning
Even-Dar, E
Mansour, Y
COMPUTATIONAL LEARNING THEORY, PROCEEDINGS, 2001, 2111 : 589 - 604
[48] Learning rates for Q-learning
Even-Dar, E
Mansour, Y
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 5 : 1 - 25
[49] Q-PSP learning: An exploitation-oriented Q-learning algorithm and its applications
Horiuchi, T
Fujino, A
Katai, O
Sawaragi, T
1996 IEEE INTERNATIONAL CONFERENCE ON EVOLUTIONARY COMPUTATION (ICEC '96), PROCEEDINGS OF, 1996, : 76 - 81
[50] An Automatic Train Operation Algorithm based on Double Q-learning
Sun, Xiaoguang
Jiang, Kun
THIRD INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION; NETWORK AND COMPUTER TECHNOLOGY (ECNCT 2021), 2022, 12167

← 1 2 3 4 5 →