A new Q-learning algorithm based on the Metropolis criterion

被引：111

作者：

Guo, MZ ^{[1
]}

Liu, Y

Malec, J

机构：

[1] Harbin Inst Technol, Dept Comp Sci & Engn, Harbin 150001, Peoples R China

[2] Lund Univ, Dept Comp Sci, S-22100 Lund, Sweden

来源：

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS | 2004年 / 34卷 / 05期

关键词：

exploitation; exploration; metropolis criterion; Q-learning; reinforcement learning;

D O I：

10.1109/TSMCB.2004.832154

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The balance between exploration and exploitation is one of the key problems of action selection in Q-learning. Pure exploitation causes the agent to reach the locally optimal policies quickly, whereas excessive exploration degrades the performance of the Q-learning algorithm even if it may accelerate the learning process and allow avoiding the locally optimal policies. In this paper, finding the optimum policy in Q-learning is de scribed as search for the optimum solution in combinatorial optimization. The Metropolis criterion of simulated annealing algorithm is introduced in order to balance exploration and exploitation of Q-learning, and the modified Q-learning algorithm based on this criterion, SA-Q-learning, is presented. Experiments show that SA-Q-learning converges more quickly than Q-learning or Boltzmann exploration, and that the search does not suffer of performance degradation due to excessive exploration.

引用

页码：2140 / 2143

页数：4

共 14 条

[1] Cooperative behavior acquisition for mobile robots in dynamically changing real worlds via vision-based reinforcement learning and development [J].

Asada, M ;

Uchibe, E ;

Hosoda, K .

ARTIFICIAL INTELLIGENCE, 1999, 110 (02) :275-292

[2] Nature's way of optimizing [J].

Boettcher, S ;

Percus, A .

ARTIFICIAL INTELLIGENCE, 2000, 119 (1-2) :275-286

[3]

Christopher John Cornish Hellaby Watkins, 1989, LEARNING DELAYED REW

[4] Explanation-based learning and reinforcement learning: A unified view [J].

Ditterich, TG ;

Flann, NS .

MACHINE LEARNING, 1997, 28 (2-3) :169-210

[5]

Guo M., 1998, COMPUTER SCI, V25, P13

[6] Reinforcement learning: A survey [J].

Kaelbling, LP ;

Littman, ML ;

Moore, AW .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 :237-285

[7] OPTIMIZATION BY SIMULATED ANNEALING [J].

KIRKPATRICK, S ;

GELATT, CD ;

VECCHI, MP .

SCIENCE, 1983, 220 (4598) :671-680

[8] EQUATION OF STATE CALCULATIONS BY FAST COMPUTING MACHINES [J].

METROPOLIS, N ;

ROSENBLUTH, AW ;

ROSENBLUTH, MN ;

TELLER, AH ;

TELLER, E .

JOURNAL OF CHEMICAL PHYSICS, 1953, 21 (06) :1087-1092

[9]

Mitchell T, 1997, MACHINE LEARING

[10]

Sutton R. S., 1988, Machine Learning, V3, P9, DOI 10.1023/A:1022633531479

← 1 2 →