Minimax Off-Policy Evaluation for Multi-Armed Bandits

被引:3
|
作者
Ma, Cong [1 ]
Zhu, Banghua [2 ]
Jiao, Jiantao [2 ,3 ]
Wainwright, Martin J. [2 ,3 ]
机构
[1] Univ Chicago, Dept Stat, Chicago, IL 60637 USA
[2] Univ Calif Berkeley UC Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley UC Berkeley, Dept Stat, Berkeley, CA 94720 USA
关键词
Switches; Probability; Monte Carlo methods; Chebyshev approximation; Measurement; Computational modeling; Sociology; Off-policy evaluation; multi-armed bandits; minimax optimality; importance sampling; POLYNOMIALS;
D O I
10.1109/TIT.2022.3162335
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger-relative to the oracle estimator equipped with the knowledge of the behavior policy- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.
引用
收藏
页码:5314 / 5339
页数:26
相关论文
共 50 条
  • [41] Multi-armed bandits for decentralized AP selection in enterprise WLANs
    Carrascosa, Marc
    Bellalta, Boris
    COMPUTER COMMUNICATIONS, 2020, 159 : 108 - 123
  • [42] On the Bias, Risk, and Consistency of Sample Means in Multi-armed Bandits
    Shin, Jaehyeok
    Ramdas, Aaditya
    Rinaldo, Alessandro
    SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE, 2021, 3 (04): : 1278 - 1300
  • [43] Thompson sampling for multi-armed bandits in big data environments
    Kim, Min Kyong
    Hwang, Beom Seuk
    KOREAN JOURNAL OF APPLIED STATISTICS, 2024, 37 (05)
  • [44] Network Defense Resource Allocation Scheme with Multi-armed Bandits
    Huang, Ning
    Feng, Xue-cai
    Zhang, Rui
    Yang, Xiu-gui
    Xia, Hui
    WIRELESS ALGORITHMS, SYSTEMS, AND APPLICATIONS (WASA 2022), PT I, 2022, 13471 : 326 - 337
  • [45] Sample complexity of partition identification using multi-armed bandits
    Juneja, Sandeep
    Krishnasamy, Subhashini
    CONFERENCE ON LEARNING THEORY, VOL 99, 2019, 99
  • [46] Multi-Armed Bandits Learning for Task Offloading in Maritime Edge Intelligence Networks
    Yang, Tingting
    Gao, Shan
    Li, Jiabo
    Qin, Meng
    Sun, Xin
    Zhang, Ran
    Wang, Miao
    Li, Xianbin
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2022, 71 (04) : 4212 - 4224
  • [47] Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays
    Nino-Mora, Jose
    MATHEMATICS, 2021, 9 (01) : 1 - 36
  • [48] Optimal Learning Policies for Differential Privacy in Multi-armed Bandits
    Wang, Siwei
    Zhu, Jun
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25
  • [49] PAC-Bayesian lifelong learning for multi-armed bandits
    Hamish Flynn
    David Reeb
    Melih Kandemir
    Jan Peters
    Data Mining and Knowledge Discovery, 2022, 36 : 841 - 876
  • [50] Distributed cooperative decision making in multi-agent multi-armed bandits
    Landgren, Peter
    Srivastava, Vaibhav
    Leonard, Naomi Ehrich
    AUTOMATICA, 2021, 125