Minimax Off-Policy Evaluation for Multi-Armed Bandits

被引:3
|
作者
Ma, Cong [1 ]
Zhu, Banghua [2 ]
Jiao, Jiantao [2 ,3 ]
Wainwright, Martin J. [2 ,3 ]
机构
[1] Univ Chicago, Dept Stat, Chicago, IL 60637 USA
[2] Univ Calif Berkeley UC Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley UC Berkeley, Dept Stat, Berkeley, CA 94720 USA
关键词
Switches; Probability; Monte Carlo methods; Chebyshev approximation; Measurement; Computational modeling; Sociology; Off-policy evaluation; multi-armed bandits; minimax optimality; importance sampling; POLYNOMIALS;
D O I
10.1109/TIT.2022.3162335
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger-relative to the oracle estimator equipped with the knowledge of the behavior policy- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.
引用
收藏
页码:5314 / 5339
页数:26
相关论文
共 50 条
  • [21] CORRELATED MULTI-ARMED BANDITS WITH A LATENT RANDOM SOURCE
    Gupta, Samarth
    Joshi, Gauri
    Yagan, Osman
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3572 - 3576
  • [22] Pruning Neural Networks Using Multi-Armed Bandits
    Ameen, Salem
    Vadera, Sunil
    COMPUTER JOURNAL, 2020, 63 (07) : 1099 - 1108
  • [23] Adaptive Data Depth via Multi-Armed Bandits
    Baharav, Tavor Z.
    Lai, Tze Leung
    JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
  • [24] Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits
    Zhan, Ruohan
    Hadad, Vitor
    Hirshberg, David A.
    Athey, Susan
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 2125 - 2135
  • [25] Infinite Horizon Multi-armed Bandits with Reward Vectors: Exploration/Exploitation Trade-off
    Drugan, Madalina M.
    AGENTS AND ARTIFICIAL INTELLIGENCE, ICAART 2015, 2015, 9494 : 128 - 144
  • [26] PAC models in stochastic multi-objective multi-armed bandits
    Drugan, Madalina M.
    PROCEEDINGS OF THE 2017 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE (GECCO'17), 2017, : 409 - 416
  • [27] Multi-User Multi-Armed Bandits for Uncoordinated Spectrum Access
    Bande, Meghana
    Veeravalli, Venugopal V.
    2019 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS (ICNC), 2019, : 653 - 657
  • [28] Multi-armed Bandits with Generalized Temporally-Partitioned Rewards
    van den Broek, Ronald C.
    Litjens, Rik
    Sagis, Tobias
    Verbeeke, Nina
    Gajane, Pratik
    ADVANCES IN INTELLIGENT DATA ANALYSIS XXII, PT I, IDA 2024, 2024, 14641 : 41 - 52
  • [29] Block pruning residual networks using Multi-Armed Bandits
    Benatia, Mohamed Akrem
    Amara, Yacine
    Boulahia, Said Yacine
    Hocini, Abdelouahab
    JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2023,
  • [30] Robust Risk-Averse Stochastic Multi-armed Bandits
    Maillard, Odalric-Ambrym
    ALGORITHMIC LEARNING THEORY (ALT 2013), 2013, 8139 : 218 - 233