Thompson Sampling Guided Stochastic Searching on the Line for Non-stationary Adversarial Learning

被引:1
作者
Glimsdal, Sondre [1 ]
Granmo, Ole-Christoffer [1 ]
机构
[1] Univ Agder, Dept ICT, Kristiansand, Norway
来源
2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA) | 2015年
关键词
D O I
10.1109/ICMLA.2015.203
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper reports the first known solution to the N-Door puzzle when the environment is both non-stationary and deceptive (adversarial learning). The Multi-Armed-Bandit (MAB) problem is the iconic representation of the exploration versus exploitation dilemma. In brief, a gambler repeatedly selects and play, one out of N possible slot machines or arms and either receives a reward or a penalty. The objective of the gambler is then to locate the most rewarding arm to play, while in the process maximize his winnings. In this paper we investigate a challenging variant of the MAB problem, namely the non-stationary N-Door puzzle. Here, instead of directly observing the reward, the gambler is only told whether the optimal arm lies to the "left" or to the "right" of the selected arm, with the feedback being erroneous with probability 1 - pi. However, due to the non-stationary property the optimal arm can abruptly and without notice switch place with a previous sub-optimal arm. To further complicate the situation, we do not assume that the environment is informative, that is, we allow for a traitorous environment that on-average guide the gambler in the opposite direction of the optimal arm (adversarial learning problem). This coupled with the non-stationary property makes for a highly demanding reinforcement learning problem. The novel scheme presented in this paper enhance the previous top contender for the stationary N-door problem with the capability to detect and adapt to a changing environment. The resulting scheme TS-NSPL is then empirically proved to be superior to the existing state-of-art.
引用
收藏
页码:687 / 692
页数:6
相关论文
共 19 条
  • [1] Atan O., 2015, ARXIV150308370
  • [2] Baeza-Yates R.A., 1988, SEARCHING UNCERTAINT
  • [3] Chapelle O, 2011, Adv. Neural Inform. Processing Systems, P2249
  • [4] Glimsdal S., 2015, THOMPSON SA IN PRESS
  • [5] Learning automata-based solutions to the nonlinear fractional knapsack problem with applications to optimal resource allocation
    Granmo, Ole-Christoffer
    Oommen, B. John
    Myrer, Svein Arild
    Olsen, Morten Goodwin
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2007, 37 (01): : 166 - 175
  • [6] Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton
    Granmo, Ole-Christoffer
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT COMPUTING AND CYBERNETICS, 2010, 3 (02) : 207 - 234
  • [7] Lattimore T., 2015, ARXIV150707880
  • [8] A solution to the stochastic point location problem in metalevel nonstationary environments
    Oommen, B. John
    Kim, Sang-Woon
    Samuel, Mathew T.
    Granmo, Ole-Christoffer
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2008, 38 (02): : 466 - 476
  • [9] Routing bandwidth-guaranteed paths in MPLS traffic engineering: A multiple race track learning approach
    Oommen, B. John
    Misra, Sudip
    Granmo, Ole-Christoffer
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2007, 56 (07) : 959 - 976
  • [10] Oommen BJ, 2009, STUD COMPUT INTELL, V214, P317