MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT

被引:15
作者
Cowan, Wesley [1 ]
Katehakis, Michael N. [2 ]
机构
[1] Rutgers State Univ, Dept Math, Piscataway, NJ 08854 USA
[2] Rutgers Business Sch, Dept Management Sci & Informat Syst, Piscataway, NJ 08854 USA
基金
美国国家科学基金会;
关键词
OPTIMAL ADAPTIVE POLICIES; GITTINS INDEX; DISCRETE;
D O I
10.1017/S0269964814000217
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor beta is an element of (0, 1) per round. In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as "decisions over state space" but rather "decisions over time".
引用
收藏
页码:51 / 76
页数:26
相关论文
共 52 条
[1]   PROPERTIES OF THE GITTINS INDEX WITH APPLICATION TO OPTIMAL SCHEDULING [J].
Aalto, Samuli ;
Ayesta, Urtzi ;
Righter, Rhonda .
PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2011, 25 (03) :269-288
[2]   Multi-robot perimeter patrol in adversarial settings [J].
Agmon, Noa ;
Kraus, Sarit ;
Kaminka, Gal A. .
2008 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VOLS 1-9, 2008, :2339-2345
[3]  
Agrawal R., 1990, Stochastics and Stochastics Reports, V29, P437, DOI 10.1080/17442509008833627
[4]  
[Anonymous], 2012, ARXIV12045721
[5]  
[Anonymous], 1986, LECT NOTES MONOGRAPH
[6]  
[Anonymous], 1989, Multi-armed Bandit Allocation Indices
[7]  
[Anonymous], 1982, Grundlehren der Mathematischen Wissenschaften
[8]  
Auer Peter, 2007, Advances in Neural Information Processing Systems, P49
[9]  
Bertsekas D.P., 2011, Dynamic Programming and Optimal Control, Vii
[10]  
Burnetas A.N., 2002, PROBAB ENG INFORM SC, V17, P157