Minimax Off-Policy Evaluation for Multi-Armed Bandits

被引：3

作者：

Ma, Cong ^{[1
]}

Zhu, Banghua ^{[2
]}

Jiao, Jiantao ^{[2
,3
]}

Wainwright, Martin J. ^{[2
,3
]}

机构：

[1] Univ Chicago, Dept Stat, Chicago, IL 60637 USA

[2] Univ Calif Berkeley UC Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA

[3] Univ Calif Berkeley UC Berkeley, Dept Stat, Berkeley, CA 94720 USA

来源：

IEEE TRANSACTIONS ON INFORMATION THEORY | 2022年 / 68卷 / 08期

关键词：

Switches; Probability; Monte Carlo methods; Chebyshev approximation; Measurement; Computational modeling; Sociology; Off-policy evaluation; multi-armed bandits; minimax optimality; importance sampling; POLYNOMIALS;

D O I：

10.1109/TIT.2022.3162335

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger-relative to the oracle estimator equipped with the knowledge of the behavior policy- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.

引用

页码：5314 / 5339

页数：26

共 50 条

[1] An empirical evaluation of active inference in multi-armed bandits
Markovic, Dimitrije
Stojic, Hrvoje
Schwoebel, Sarah
Kiebel, Stefan J.
NEURAL NETWORKS, 2021, 144 : 229 - 246
[2] Multi-armed bandits with dependent arms
Singh, Rahul
Liu, Fang
Sun, Yin
Shroff, Ness
MACHINE LEARNING, 2024, 113 (01) : 45 - 71
[3] Multi-armed bandits with episode context
Christopher D. Rosin
Annals of Mathematics and Artificial Intelligence, 2011, 61 : 203 - 230
[4] Multi-Armed Bandits With Costly Probes
Elumar, Eray Can
Tekin, Cem
Yagan, Osman
IEEE TRANSACTIONS ON INFORMATION THEORY, 2025, 71 (01) : 618 - 643
[5] Multi-Armed Bandits With Correlated Arms
Gupta, Samarth
Chaudhari, Shreyas
Joshi, Gauri
Yagan, Osman
IEEE TRANSACTIONS ON INFORMATION THEORY, 2021, 67 (10) : 6711 - 6732
[6] Multi-armed bandits with episode context
Rosin, Christopher D.
ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2011, 61 (03) : 203 - 230
[7] LEVY BANDITS: MULTI-ARMED BANDITS DRIVEN BY LEVY PROCESSES
Kaspi, Haya
Mandelbaum, Avi
ANNALS OF APPLIED PROBABILITY, 1995, 5 (02) : 541 - 565
[8] Combinatorial Multi-armed Bandits for Resource Allocation
Zuo, Jinhang
Joe-Wong, Carlee
2021 55TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2021,
[9] Quantum greedy algorithms for multi-armed bandits
Hiroshi Ohno
Quantum Information Processing, 22
[10] Multi-armed bandits in discrete and continuous time
Kaspi, H
Mandelbaum, A
ANNALS OF APPLIED PROBABILITY, 1998, 8 (04) : 1270 - 1290

← 1 2 3 4 5 →