Minimax Off-Policy Evaluation for Multi-Armed Bandits

被引：3

作者：

Ma, Cong ^{[1
]}

Zhu, Banghua ^{[2
]}

Jiao, Jiantao ^{[2
,3
]}

Wainwright, Martin J. ^{[2
,3
]}

机构：

[1] Univ Chicago, Dept Stat, Chicago, IL 60637 USA

[2] Univ Calif Berkeley UC Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA

[3] Univ Calif Berkeley UC Berkeley, Dept Stat, Berkeley, CA 94720 USA

来源：

IEEE TRANSACTIONS ON INFORMATION THEORY | 2022年 / 68卷 / 08期

关键词：

Switches; Probability; Monte Carlo methods; Chebyshev approximation; Measurement; Computational modeling; Sociology; Off-policy evaluation; multi-armed bandits; minimax optimality; importance sampling; POLYNOMIALS;

D O I：

10.1109/TIT.2022.3162335

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger-relative to the oracle estimator equipped with the knowledge of the behavior policy- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.

引用

页码：5314 / 5339

页数：26

共 50 条

[31] Best-Arm Identification in Correlated Multi-Armed Bandits
Gupta S.
Joshi G.
Yagan O.
IEEE Journal on Selected Areas in Information Theory, 2021, 2 (02): : 549 - 563
[32] Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems
Ravi, Aditya Narayan
Poduval, Pranav
Moharir, Sharayu
2020 INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS & NETWORKS (COMSNETS), 2020,
[33] PAC-Bayesian lifelong learning for multi-armed bandits
Flynn, Hamish
Reeb, David
Kandemir, Melih
Peters, Jan
DATA MINING AND KNOWLEDGE DISCOVERY, 2022, 36 (02) : 841 - 876
[34] Human-AI Learning Performance in Multi-Armed Bandits
Pandya, Ravi
Huang, Sandy H.
Hadfield-Menell, Dylan
Dragan, Anca D.
AIES '19: PROCEEDINGS OF THE 2019 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, 2019, : 369 - 375
[35] The Perils of Misspecified Priors and Optional Stopping in Multi-Armed Bandits
Loecher, Markus
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2021, 4
[36] Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits
Li, Zifan
Tewari, Ambuj
JOURNAL OF MACHINE LEARNING RESEARCH, 2018, 18
[37] Multi-Armed Bandits for Many-Task Evolutionary Optimization
Le Tien Thanh
Le Van Cuong
Ta Bao Thang
Huynh Thi Thanh Binh
2021 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC 2021), 2021, : 1664 - 1671
[38] Multi-Armed Bandits on Partially Revealed Unit Interval Graphs
Xu, Xiao
Vakili, Sattar
Zhao, Qing
Swami, Ananthram
IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2020, 7 (03): : 1453 - 1465
[39] NONSTOCHASTIC MULTI-ARMED BANDITS WITH GRAPH-STRUCTURED FEEDBACK
Alon, Noga
Cesa-Bianchi, Nicolo
Gentile, Claudio
Mannor, Shie
Mansour, Yishay
Shamir, Ohad
SIAM JOURNAL ON COMPUTING, 2017, 46 (06) : 1785 - 1826
[40] Adaptive Noise Exploration for Neural Contextual Multi-Armed Bandits
Wang, Chi
Shi, Lin
Luo, Junru
ALGORITHMS, 2025, 18 (02)

← 1 2 3 4 5 →