Thompson Sampling Guided Stochastic Searching on the Line for Non-stationary Adversarial Learning

被引：1

作者：

Glimsdal, Sondre ^{[1
]}

Granmo, Ole-Christoffer ^{[1
]}

机构：

[1] Univ Agder, Dept ICT, Kristiansand, Norway

来源：

2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA) | 2015年

关键词：

D O I：

10.1109/ICMLA.2015.203

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper reports the first known solution to the N-Door puzzle when the environment is both non-stationary and deceptive (adversarial learning). The Multi-Armed-Bandit (MAB) problem is the iconic representation of the exploration versus exploitation dilemma. In brief, a gambler repeatedly selects and play, one out of N possible slot machines or arms and either receives a reward or a penalty. The objective of the gambler is then to locate the most rewarding arm to play, while in the process maximize his winnings. In this paper we investigate a challenging variant of the MAB problem, namely the non-stationary N-Door puzzle. Here, instead of directly observing the reward, the gambler is only told whether the optimal arm lies to the "left" or to the "right" of the selected arm, with the feedback being erroneous with probability 1 - pi. However, due to the non-stationary property the optimal arm can abruptly and without notice switch place with a previous sub-optimal arm. To further complicate the situation, we do not assume that the environment is informative, that is, we allow for a traitorous environment that on-average guide the gambler in the opposite direction of the optimal arm (adversarial learning problem). This coupled with the non-stationary property makes for a highly demanding reinforcement learning problem. The novel scheme presented in this paper enhance the previous top contender for the stationary N-door problem with the capability to detect and adapt to a changing environment. The resulting scheme TS-NSPL is then empirically proved to be superior to the existing state-of-art.

引用

页码：687 / 692

页数：6

共 19 条

[1] Atan O., 2015, ARXIV150308370
[2] Baeza-Yates R.A., 1988, SEARCHING UNCERTAINT
[3] Chapelle O, 2011, Adv. Neural Inform. Processing Systems, P2249
[4] Glimsdal S., 2015, THOMPSON SA IN PRESS
[5] Learning automata-based solutions to the nonlinear fractional knapsack problem with applications to optimal resource allocation
Granmo, Ole-Christoffer
Oommen, B. John
Myrer, Svein Arild
Olsen, Morten Goodwin
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2007, 37 (01): : 166 - 175
[6] Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton
Granmo, Ole-Christoffer
[J]. INTERNATIONAL JOURNAL OF INTELLIGENT COMPUTING AND CYBERNETICS, 2010, 3 (02) : 207 - 234
[7] Lattimore T., 2015, ARXIV150707880
[8] A solution to the stochastic point location problem in metalevel nonstationary environments
Oommen, B. John
Kim, Sang-Woon
Samuel, Mathew T.
Granmo, Ole-Christoffer
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2008, 38 (02): : 466 - 476
[9] Routing bandwidth-guaranteed paths in MPLS traffic engineering: A multiple race track learning approach
Oommen, B. John
Misra, Sudip
Granmo, Ole-Christoffer
[J]. IEEE TRANSACTIONS ON COMPUTERS, 2007, 56 (07) : 959 - 976
[10] Oommen BJ, 2009, STUD COMPUT INTELL, V214, P317

← 1 2 →