Percentile optimization in multi-armed bandit problems

被引：1

作者：

Ghatrani, Zahra ^{[1
]}

Ghate, Archis ^{[1
]}

机构：

[1] Univ Washington, Ind & Syst Engn, Seattle, WA 98195 USA

来源：

ANNALS OF OPERATIONS RESEARCH | 2024年 / 340卷 / 2-3期

关键词：

Dynamic programming; Lagrangian relaxation; Chance-constrained programming;

D O I：

10.1007/s10479-024-06165-4

中图分类号：

C93 [管理学]; O22 [运筹学];

学科分类号：

070105 ; 12 ; 1201 ; 1202 ; 120202 ;

摘要：

A multi-armed bandit (MAB) problem is described as follows. At each time-step, a decision-maker selects one arm from a finite set. A reward is earned from this arm and the state of that arm evolves stochastically. The goal is to determine an arm-pulling policy that maximizes expected total discounted reward over an infinite horizon. We study MAB problems where the rewards are multivariate Gaussian, to account for data-driven estimation errors. We employ a percentile optimization approach, wherein the goal is to find an arm-pulling policy that maximizes the sum of percentiles of expected total discounted rewards earned from individual arms. The idea is motivated by recent work on percentile optimization in Markov decision processes. We demonstrate that, when applied to MABs, this yields an intractable second-order cone program (SOCP) whose size is exponential in the number of arms. We use Lagrangian relaxation to break the resulting curse-of-dimensionality. Specifically, we show that the relaxed problem can be reformulated as an SOCP with size linear in the number of arms. We propose three approaches to recover feasible arm-pulling decisions during run-time from an off-line optimal solution of this SOCP. Our numerical experiments suggest that one of these three method appears to be more effective than the other two.

引用

页码：837 / 862

页数：26

共 18 条

[1] Relaxations of weakly coupled stochastic dynamic programs [J].

Adelman, Daniel ;

Mersereau, Adam J. .

OPERATIONS RESEARCH, 2008, 56 (03) :712-727

[2] Second-order cone programming [J].

Alizadeh, F ;

Goldfarb, D .

MATHEMATICAL PROGRAMMING, 2003, 95 (01) :3-51

[3]

Bertsekas D., 1999, Nonlinear Programming, V2nd

[4]

Bertsekas D., 2000, Dynamic programming and optimal control, V2nd

[5]

Caro F., 2015, ANN OPER RES, P1

[6] CHANCE-CONSTRAINED PROGRAMMING [J].

CHARNES, A ;

COOPER, WW .

MANAGEMENT SCIENCE, 1959, 6 (01) :73-79

[7] Percentile Optimization for Markov Decision Processes with Parameter Uncertainty [J].

Delage, Erick ;

Mannor, Shie .

OPERATIONS RESEARCH, 2010, 58 (01) :203-213

[8]

Ghatrani Z., 2021, THESIS U WASHINGTON

[9]

GITTINS JC, 1979, J ROY STAT SOC B MET, V41, P148

[10]

Hawkins J. T., 2003, Ph.D. thesis

← 1 2 →