Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems

被引：52

作者：

Koulouriotis, D. E. ^{[1
]}

Xanthopoulos, A. ^{[1
]}

机构：

[1] Democritus Univ Thrace, Sch Engn, Dept Prod & Management Engn, Dragana, Greece

来源：

APPLIED MATHEMATICS AND COMPUTATION | 2008年 / 196卷 / 02期

关键词：

decision-making agents; action selection; exploration-exploitation; multi-armed bandit; genetic algorithms; reinforcement learning;

D O I：

10.1016/j.amc.2007.07.043

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Multi-armed bandit tasks have been extensively used to model the problem of balancing exploitation and exploration. A most challenging variant of the MABP is the non-stationary bandit problem where the agent is faced with the increased complexity of detecting changes in its environment. In this paper we examine a non-stationary, discrete-time, finite horizon bandit problem with a finite number of arms and Gaussian rewards. A family of important ad hoc methods exists that are suitable for non-stationary bandit tasks. These learning algorithms that offer intuition-based solutions to the exploitation-exploration trade-off have the advantage of not relying on strong theoretical assumptions while in the same time can be fine-tuned in order to produce near-optimal results. An entirely different approach to the non-stationary multi-armed bandit problem presents itself in the face of evolutionary algorithms. We present an evolutionary algorithm that was implemented to solve the non-stationary bandit problem along with ad hoc solution algorithms, namely action-value methods with e-greedy and softmax action selection rules, the probability matching method and finally the adaptive pursuit method. A number of simulation-based experiments was conducted and based on the numerical results that we obtained we discuss the methods' performances. (C) 2007 Elsevier Inc. All rights reserved.

引用

页码：913 / 922

页数：10

共 17 条

[1]

Auer P, 1995, AN S FDN CO, P322, DOI 10.1109/SFCS.1995.492488

[2] Exploitation vs. exploration: choosing a supplier in an environment of incomplete information [J].

Azoulay-Schwartz, R ;

Kraus, S ;

Wilkenfeld, J .

DECISION SUPPORT SYSTEMS, 2004, 38 (01) :1-18

[3] SWITCHING COSTS AND THE GITTINS INDEX [J].

BANKS, JS ;

SUNDARAM, RK .

ECONOMETRICA, 1994, 62 (03) :687-694

[4]

Berry D. A., 1985, BANDIT PROBLEMS SEQU

[5]

Chulkov D. V., 2005, Information Management & Computer Security, V13, P135, DOI 10.1108/09685220510589316

[6] Do evolutionary processes minimize expected losses? [J].

Fogel, DB ;

Beyer, HG .

JOURNAL OF THEORETICAL BIOLOGY, 2000, 207 (01) :117-123

[7]

Gittins J.C., 1989, MULTIARMED BANDIT AL

[8] Survey on the bandit problem with switching costs [J].

Jun, TS .

ECONOMIST-NETHERLANDS, 2004, 152 (04) :513-541

[9] Reinforcement learning: A survey [J].

Kaelbling, LP ;

Littman, ML ;

Moore, AW .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 :237-285

[10]

Kaspi H, 1998, ANN APPL PROBAB, V8, P1270

← 1 2 →