Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems

被引:52
作者
Koulouriotis, D. E. [1 ]
Xanthopoulos, A. [1 ]
机构
[1] Democritus Univ Thrace, Sch Engn, Dept Prod & Management Engn, Dragana, Greece
关键词
decision-making agents; action selection; exploration-exploitation; multi-armed bandit; genetic algorithms; reinforcement learning;
D O I
10.1016/j.amc.2007.07.043
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Multi-armed bandit tasks have been extensively used to model the problem of balancing exploitation and exploration. A most challenging variant of the MABP is the non-stationary bandit problem where the agent is faced with the increased complexity of detecting changes in its environment. In this paper we examine a non-stationary, discrete-time, finite horizon bandit problem with a finite number of arms and Gaussian rewards. A family of important ad hoc methods exists that are suitable for non-stationary bandit tasks. These learning algorithms that offer intuition-based solutions to the exploitation-exploration trade-off have the advantage of not relying on strong theoretical assumptions while in the same time can be fine-tuned in order to produce near-optimal results. An entirely different approach to the non-stationary multi-armed bandit problem presents itself in the face of evolutionary algorithms. We present an evolutionary algorithm that was implemented to solve the non-stationary bandit problem along with ad hoc solution algorithms, namely action-value methods with e-greedy and softmax action selection rules, the probability matching method and finally the adaptive pursuit method. A number of simulation-based experiments was conducted and based on the numerical results that we obtained we discuss the methods' performances. (C) 2007 Elsevier Inc. All rights reserved.
引用
收藏
页码:913 / 922
页数:10
相关论文
共 17 条
[1]  
Auer P, 1995, AN S FDN CO, P322, DOI 10.1109/SFCS.1995.492488
[2]   Exploitation vs. exploration: choosing a supplier in an environment of incomplete information [J].
Azoulay-Schwartz, R ;
Kraus, S ;
Wilkenfeld, J .
DECISION SUPPORT SYSTEMS, 2004, 38 (01) :1-18
[3]   SWITCHING COSTS AND THE GITTINS INDEX [J].
BANKS, JS ;
SUNDARAM, RK .
ECONOMETRICA, 1994, 62 (03) :687-694
[4]  
Berry D. A., 1985, BANDIT PROBLEMS SEQU
[5]  
Chulkov D. V., 2005, Information Management & Computer Security, V13, P135, DOI 10.1108/09685220510589316
[6]   Do evolutionary processes minimize expected losses? [J].
Fogel, DB ;
Beyer, HG .
JOURNAL OF THEORETICAL BIOLOGY, 2000, 207 (01) :117-123
[7]  
Gittins J.C., 1989, MULTIARMED BANDIT AL
[8]   Survey on the bandit problem with switching costs [J].
Jun, TS .
ECONOMIST-NETHERLANDS, 2004, 152 (04) :513-541
[9]   Reinforcement learning: A survey [J].
Kaelbling, LP ;
Littman, ML ;
Moore, AW .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 :237-285
[10]  
Kaspi H, 1998, ANN APPL PROBAB, V8, P1270