Multi-Armed Bandits With Self-Information Rewards

被引：1

作者：

Weinberger, Nir ^{[1
]}

Yemini, Michal ^{[2
]}

机构：

[1] Technion Israel Inst Technol, Fac Elect & Comp Engn, IL-3200003 Haifa, Israel

[2] Bar Ilan Univ, Fac Engn, IL-5290002 Ramat Gan, Israel

来源：

IEEE TRANSACTIONS ON INFORMATION THEORY | 2023年 / 69卷 / 11期

基金：

以色列科学基金会;

关键词：

Multi-armed bandits; self-information rewards; entropy estimation; upper confidence bounds; support size estimation; RANDOM-VARIABLES; ENTROPY; BOUNDS; DIVERSITY;

D O I：

10.1109/TIT.2023.3299460

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper introduces the informational multi-armed bandit (IMAB) model, in which at each round, a player chooses an arm, observes a symbol, and receives an unobserved reward in the form of the symbol's self-information. Thus, the expected reward of an arm is the Shannon entropy of the probability mass function of the source that generates its symbols. The player aims to maximize the expected total reward associated with the entropy values of the arms played. Under the assumption that the alphabet size is known, two UCB-based algorithms are proposed for the IMAB model which consider the biases of the plug-in entropy estimator. The first algorithm optimistically corrects the bias term in the entropy estimation. The second algorithm relies on data-dependent confidence intervals that adapt to sources with small entropy values. Performance guarantees are provided by upper bounding the expected regret of each of the algorithms. Furthermore, in the Bernoulli case, the asymptotic behavior of these algorithms is compared to the Lai-Robbins lower bound for the pseudo regret. Additionally, under the assumption that the exact alphabet size is unknown, and instead the player only knows a loose upper bound on it, a UCB-based algorithm is proposed, in which the player aims to reduce the regret caused by the unknown alphabet size in a finite time regime. Numerical results illustrating the expected regret of the algorithms presented in the paper are provided.

引用

页码：7160 / 7184

页数：25

共 50 条

[1] The value of information in multi-armed bandits with exponentially distributed rewards
Ryzhov, Ilya O.
Powell, Warren B.
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 1363 - 1372
[2] Trading off Rewards and Errors in Multi-Armed Bandits
Erraqabi, Akram
Lazaric, Alessandro
Valko, Michal
Brunskill, Emma
Liu, Yun-En
ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 54, 2017, 54 : 709 - 717
[3] Parametrized Stochastic Multi-armed Bandits with Binary Rewards
Jiang, Chong
Srikant, R.
2011 AMERICAN CONTROL CONFERENCE, 2011, : 119 - 124
[4] Multi-armed Bandits with Generalized Temporally-Partitioned Rewards
van den Broek, Ronald C.
Litjens, Rik
Sagis, Tobias
Verbeeke, Nina
Gajane, Pratik
ADVANCES IN INTELLIGENT DATA ANALYSIS XXII, PT I, IDA 2024, 2024, 14641 : 41 - 52
[5] Combinatorial Multi-Armed Bandits with Concave Rewards and Fairness Constraints
Xu, Huanle
Liu, Yang
Lau, Wing Cheong
Li, Rui
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2554 - 2560
[6] Multi-player Multi-armed Bandits: Decentralized Learning with IID Rewards
Kalathil, Dileep
Nayyar, Naumaan
Jain, Rahul
2012 50TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2012, : 853 - 860
[7] Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards
Lee, Kyungjae
Yang, Hongjun
Lim, Sungbin
Oh, Songhwai
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[8] Maximizing and Satisficing in Multi-armed Bandits with Graph Information
Thaker, Parth K.
Malu, Mohit
Rao, Nikhil
Dasarathy, Gautam
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[9] On Kernelized Multi-armed Bandits
Chowdhury, Sayak Ray
Gopalan, Aditya
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[10] Multi-armed Bandits with Compensation
Wang, Siwei
Huang, Longbo
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31

← 1 2 3 4 5 →