Nonstochastic Bandits with Composite Anonymous Feedback

被引：0

作者：

Cesa-Bianchi, Nicolo ^{[1
]}

Cesari, Tommaso ^{[2
,3
]}

Colomboni, Roberto ^{[4
,5
]}

Gentile, Claudio ^{[6
]}

Mansour, Yishay ^{[7
,8
]}

机构：

[1] Univ Milan, Dept Comp Sci & DSRC, Milan, Italy

[2] Paul Sabatier Univ UT3, Inst Math Toulouse IMT, Toulouse, France

[3] Toulouse Sch Econ TSE, Toulouse, France

[4] Ist Italiano Tecnol IIT, Genoa, Italy

[5] Univ Milan, Milan, Italy

[6] Google Res, New York, NY USA

[7] Tel Aviv Univ, Tel Aviv, Israel

[8] Google Res, Tel Aviv, Israel

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2022年 / 23卷

基金：

欧洲研究理事会; 以色列科学基金会;

关键词：

Multi-armed bandits; non-stochastic losses; composite losses; delayed feed-back; online learning; MULTIARMED BANDIT;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We investigate a nonstochastic bandit setting in which the loss of an action is not immedi-ately charged to the player, but rather spread over the subsequent rounds in an adversarial way. The instantaneous loss observed by the player at the end of each round is then a sum of many loss components of previously played actions. This setting encompasses as a special case the easier task of bandits with delayed feedback, a well-studied framework where the player observes the delayed losses individually. Our first contribution is a general reduction transforming a standard bandit algorithm into one that can operate in the harder setting: We bound the regret of the transformed algorithm in terms of the stability and regret of the original algorithm. Then, we show that the transformation of a suitably tuned FTRL with Tsallis entropy has a regret of order ,/(d + 1)KT, where d is the maximum delay, K is the number of arms, and T is the time horizon. Finally, we show that our results cannot be improved in general by exhibiting a matching (up to a log factor) lower bound on the regret of any algorithm operating in this setting.

引用

页码：750 / 773

页数：24

共 34 条

[1] Abernethy Jacob D, 2015, Advances in Neural Information Processing Systems, V28
[2] Agrawal P, 2020, PR MACH LEARN RES, V124, P470
[3] Anava O, 2015, ADV NEUR IN, V28
[4] [Anonymous], 2015, C LEARNING THEORY
[5] Arora R., 2012, PROC 29 ICML
[6] Auer P, 2003, SIAM J COMPUT, V32, P48, DOI 10.1137/S0097539701398375
[7] Cesa-Bianchi N., 2018, PMLR, P750
[8] Cesa-Bianchi N., 2016, C LEARN THEOR, P605
[9] Dekel O., 2014, ADV NEURAL INFORM PR, V27, P1610
[10] Dekel O., 2014, C LEARN THEOR, P1214

← 1 2 3 4 →