Robust control of the multi-armed bandit problem

被引:3
作者
Caro, Felipe [1 ]
Das Gupta, Aparupa [1 ]
机构
[1] Univ Calif Los Angeles, Anderson Sch Management, Los Angeles, CA 90095 USA
关键词
Multiarmed bandit; Index policies; Bellman equation; Robust Markov decision processes; Uncertain transition matrix; Project selection; MARKOV DECISION-PROCESSES; OPTIMAL ADAPTIVE POLICIES; ALLOCATION;
D O I
10.1007/s10479-015-1965-7
中图分类号
C93 [管理学]; O22 [运筹学];
学科分类号
070105 ; 12 ; 1201 ; 1202 ; 120202 ;
摘要
We study a robust model of the multi-armed bandit (MAB) problem in which the transition probabilities are ambiguous and belong to subsets of the probability simplex. We first show that for each arm there exists a robust counterpart of the Gittins index that is the solution to a robust optimal stopping-time problem and can be computed effectively with an equivalent restart problem. We then characterize the optimal policy of the robust MAB as a project-by-project retirement policy but we show that arms become dependent so the policy based on the robust Gittins index is not optimal. For a project selection problem, we show that the robust Gittins index policy is near optimal but its implementation requires more computational effort than solving a non-robust MAB problem. Hence, we propose a Lagrangian index policy that requires the same computational effort as evaluating the indices of a non-robust MAB and is within 1 % of the optimum in the robust project selection problem.
引用
收藏
页码:461 / 480
页数:20
相关论文
共 30 条
[1]  
[Anonymous], 2001, Tech. Rep. CMU-RI-TR-01-25
[2]  
Auer P, 2003, SIAM J COMPUT, V32, P48, DOI 10.1137/S0097539701398375
[3]  
Bertsekas D. P., 2017, Dynamic programming and optimal control, VI
[4]  
Besbes O., 2014, OPTIMAL EXPLORATION
[5]   Optimal adaptive policies for sequential allocation problems [J].
Burnetas, AN ;
Katehakis, MN .
ADVANCES IN APPLIED MATHEMATICS, 1996, 17 (02) :122-142
[6]   Optimal adaptive policies for Markov decision processes [J].
Burnetas, AN ;
Katehakis, MN .
MATHEMATICS OF OPERATIONS RESEARCH, 1997, 22 (01) :222-255
[7]   INDEXABILITY OF BANDIT PROBLEMS WITH RESPONSE DELAYS [J].
Caro, Felipe ;
Yoo, Onesun Steve .
PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2010, 24 (03) :349-374
[8]   MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT [J].
Cowan, Wesley ;
Katehakis, Michael N. .
PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2015, 29 (01) :51-76
[9]   Percentile Optimization for Markov Decision Processes with Parameter Uncertainty [J].
Delage, Erick ;
Mannor, Shie .
OPERATIONS RESEARCH, 2010, 58 (01) :203-213
[10]   The multi-armed bandit, with constraints [J].
Denardo, Eric V. ;
Feinberg, Eugene A. ;
Rothblum, Uriel G. .
ANNALS OF OPERATIONS RESEARCH, 2013, 208 (01) :37-62