Minimax Off-Policy Evaluation for Multi-Armed Bandits

被引:3
|
作者
Ma, Cong [1 ]
Zhu, Banghua [2 ]
Jiao, Jiantao [2 ,3 ]
Wainwright, Martin J. [2 ,3 ]
机构
[1] Univ Chicago, Dept Stat, Chicago, IL 60637 USA
[2] Univ Calif Berkeley UC Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley UC Berkeley, Dept Stat, Berkeley, CA 94720 USA
关键词
Switches; Probability; Monte Carlo methods; Chebyshev approximation; Measurement; Computational modeling; Sociology; Off-policy evaluation; multi-armed bandits; minimax optimality; importance sampling; POLYNOMIALS;
D O I
10.1109/TIT.2022.3162335
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger-relative to the oracle estimator equipped with the knowledge of the behavior policy- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.
引用
收藏
页码:5314 / 5339
页数:26
相关论文
共 50 条
  • [31] Best-Arm Identification in Correlated Multi-Armed Bandits
    Gupta S.
    Joshi G.
    Yagan O.
    IEEE Journal on Selected Areas in Information Theory, 2021, 2 (02): : 549 - 563
  • [32] Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems
    Ravi, Aditya Narayan
    Poduval, Pranav
    Moharir, Sharayu
    2020 INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS & NETWORKS (COMSNETS), 2020,
  • [33] PAC-Bayesian lifelong learning for multi-armed bandits
    Flynn, Hamish
    Reeb, David
    Kandemir, Melih
    Peters, Jan
    DATA MINING AND KNOWLEDGE DISCOVERY, 2022, 36 (02) : 841 - 876
  • [34] Human-AI Learning Performance in Multi-Armed Bandits
    Pandya, Ravi
    Huang, Sandy H.
    Hadfield-Menell, Dylan
    Dragan, Anca D.
    AIES '19: PROCEEDINGS OF THE 2019 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, 2019, : 369 - 375
  • [35] The Perils of Misspecified Priors and Optional Stopping in Multi-Armed Bandits
    Loecher, Markus
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2021, 4
  • [36] Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits
    Li, Zifan
    Tewari, Ambuj
    JOURNAL OF MACHINE LEARNING RESEARCH, 2018, 18
  • [37] Multi-Armed Bandits for Many-Task Evolutionary Optimization
    Le Tien Thanh
    Le Van Cuong
    Ta Bao Thang
    Huynh Thi Thanh Binh
    2021 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC 2021), 2021, : 1664 - 1671
  • [38] Multi-Armed Bandits on Partially Revealed Unit Interval Graphs
    Xu, Xiao
    Vakili, Sattar
    Zhao, Qing
    Swami, Ananthram
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2020, 7 (03): : 1453 - 1465
  • [39] NONSTOCHASTIC MULTI-ARMED BANDITS WITH GRAPH-STRUCTURED FEEDBACK
    Alon, Noga
    Cesa-Bianchi, Nicolo
    Gentile, Claudio
    Mannor, Shie
    Mansour, Yishay
    Shamir, Ohad
    SIAM JOURNAL ON COMPUTING, 2017, 46 (06) : 1785 - 1826
  • [40] Adaptive Noise Exploration for Neural Contextual Multi-Armed Bandits
    Wang, Chi
    Shi, Lin
    Luo, Junru
    ALGORITHMS, 2025, 18 (02)