Robust control of the multi-armed bandit problem

被引：3

作者：

Caro, Felipe ^{[1
]}

Das Gupta, Aparupa ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Anderson Sch Management, Los Angeles, CA 90095 USA

来源：

ANNALS OF OPERATIONS RESEARCH | 2022年 / 317卷 / 02期

关键词：

Multiarmed bandit; Index policies; Bellman equation; Robust Markov decision processes; Uncertain transition matrix; Project selection; MARKOV DECISION-PROCESSES; OPTIMAL ADAPTIVE POLICIES; ALLOCATION;

D O I：

10.1007/s10479-015-1965-7

中图分类号：

C93 [管理学]; O22 [运筹学];

学科分类号：

070105 ; 12 ; 1201 ; 1202 ; 120202 ;

摘要：

We study a robust model of the multi-armed bandit (MAB) problem in which the transition probabilities are ambiguous and belong to subsets of the probability simplex. We first show that for each arm there exists a robust counterpart of the Gittins index that is the solution to a robust optimal stopping-time problem and can be computed effectively with an equivalent restart problem. We then characterize the optimal policy of the robust MAB as a project-by-project retirement policy but we show that arms become dependent so the policy based on the robust Gittins index is not optimal. For a project selection problem, we show that the robust Gittins index policy is near optimal but its implementation requires more computational effort than solving a non-robust MAB problem. Hence, we propose a Lagrangian index policy that requires the same computational effort as evaluating the indices of a non-robust MAB and is within 1 % of the optimum in the robust project selection problem.

引用

页码：461 / 480

页数：20

共 30 条

[1]

[Anonymous], 2001, Tech. Rep. CMU-RI-TR-01-25

[2]

Auer P, 2003, SIAM J COMPUT, V32, P48, DOI 10.1137/S0097539701398375

[3]

Bertsekas D. P., 2017, Dynamic programming and optimal control, VI

[4]

Besbes O., 2014, OPTIMAL EXPLORATION

[5] Optimal adaptive policies for sequential allocation problems [J].

Burnetas, AN ;

Katehakis, MN .

ADVANCES IN APPLIED MATHEMATICS, 1996, 17 (02) :122-142

[6] Optimal adaptive policies for Markov decision processes [J].

Burnetas, AN ;

Katehakis, MN .

MATHEMATICS OF OPERATIONS RESEARCH, 1997, 22 (01) :222-255

[7] INDEXABILITY OF BANDIT PROBLEMS WITH RESPONSE DELAYS [J].

Caro, Felipe ;

Yoo, Onesun Steve .

PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2010, 24 (03) :349-374

[8] MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT [J].

Cowan, Wesley ;

Katehakis, Michael N. .

PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2015, 29 (01) :51-76

[9] Percentile Optimization for Markov Decision Processes with Parameter Uncertainty [J].

Delage, Erick ;

Mannor, Shie .

OPERATIONS RESEARCH, 2010, 58 (01) :203-213

[10] The multi-armed bandit, with constraints [J].

Denardo, Eric V. ;

Feinberg, Eugene A. ;

Rothblum, Uriel G. .

ANNALS OF OPERATIONS RESEARCH, 2013, 208 (01) :37-62

← 1 2 3 →