A new Q-learning algorithm based on the Metropolis criterion

被引:111
作者
Guo, MZ [1 ]
Liu, Y
Malec, J
机构
[1] Harbin Inst Technol, Dept Comp Sci & Engn, Harbin 150001, Peoples R China
[2] Lund Univ, Dept Comp Sci, S-22100 Lund, Sweden
来源
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS | 2004年 / 34卷 / 05期
关键词
exploitation; exploration; metropolis criterion; Q-learning; reinforcement learning;
D O I
10.1109/TSMCB.2004.832154
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The balance between exploration and exploitation is one of the key problems of action selection in Q-learning. Pure exploitation causes the agent to reach the locally optimal policies quickly, whereas excessive exploration degrades the performance of the Q-learning algorithm even if it may accelerate the learning process and allow avoiding the locally optimal policies. In this paper, finding the optimum policy in Q-learning is de scribed as search for the optimum solution in combinatorial optimization. The Metropolis criterion of simulated annealing algorithm is introduced in order to balance exploration and exploitation of Q-learning, and the modified Q-learning algorithm based on this criterion, SA-Q-learning, is presented. Experiments show that SA-Q-learning converges more quickly than Q-learning or Boltzmann exploration, and that the search does not suffer of performance degradation due to excessive exploration.
引用
收藏
页码:2140 / 2143
页数:4
相关论文
共 14 条
[1]   Cooperative behavior acquisition for mobile robots in dynamically changing real worlds via vision-based reinforcement learning and development [J].
Asada, M ;
Uchibe, E ;
Hosoda, K .
ARTIFICIAL INTELLIGENCE, 1999, 110 (02) :275-292
[2]   Nature's way of optimizing [J].
Boettcher, S ;
Percus, A .
ARTIFICIAL INTELLIGENCE, 2000, 119 (1-2) :275-286
[3]  
Christopher John Cornish Hellaby Watkins, 1989, LEARNING DELAYED REW
[4]   Explanation-based learning and reinforcement learning: A unified view [J].
Ditterich, TG ;
Flann, NS .
MACHINE LEARNING, 1997, 28 (2-3) :169-210
[5]  
Guo M., 1998, COMPUTER SCI, V25, P13
[6]   Reinforcement learning: A survey [J].
Kaelbling, LP ;
Littman, ML ;
Moore, AW .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 :237-285
[7]   OPTIMIZATION BY SIMULATED ANNEALING [J].
KIRKPATRICK, S ;
GELATT, CD ;
VECCHI, MP .
SCIENCE, 1983, 220 (4598) :671-680
[8]   EQUATION OF STATE CALCULATIONS BY FAST COMPUTING MACHINES [J].
METROPOLIS, N ;
ROSENBLUTH, AW ;
ROSENBLUTH, MN ;
TELLER, AH ;
TELLER, E .
JOURNAL OF CHEMICAL PHYSICS, 1953, 21 (06) :1087-1092
[9]  
Mitchell T, 1997, MACHINE LEARING
[10]  
Sutton R. S., 1988, Machine Learning, V3, P9, DOI 10.1023/A:1022633531479