First- and Second-Order Bounds for Adversarial Linear Contextual Bandits

被引：0

作者：

Olkhovskaya, Julia ^{[1
,5
]}

Mayo, Jack ^{[2
]}

van Erven, Tim ^{[2
]}

Neu, Gergely ^{[3
]}

Wei, Chen-Yu ^{[4
]}

机构：

[1] Delft Univ Technol, Dept Intelligent Syst, Delft, Netherlands

[2] Univ Amsterdam, Korteweg de Vries Inst Math, Amsterdam, Netherlands

[3] Univ Pompeu Fabra, AI Grp, DTIC, Barcelona, Spain

[4] MIT, MIT Inst Data Syst & Soc, 77 Massachusetts Ave, Cambridge, MA 02139 USA

[5] Vrije Univ Amsterdam, Amsterdam, Netherlands

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

欧洲研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of K arms to change over time without restriction. Assuming the d-dimensional contexts are drawn from a fixed known distribution, the worst-case expected regret over the course of T rounds is known to scale as (O) over tilde(root KdT). Under the additional assumption that the density of the contexts is log-concave, we obtain a second-order bound of order (O) over tilde(root dV(T)) in terms of the cumulative second moment of the learner's losses V-T, and a closely related first-order bound of order (O) over tilde (K root dL(T)*) in terms of the cumulative loss of the best policy L-T*. Since V-T or L-T* may be significantly smaller than T, these improve over the worst-case regret whenever the environment is relatively benign. Our results are obtained using a truncated version of the continuous exponential weights algorithm over the probability simplex, which we analyse by exploiting a novel connection to the linear bandit setting without contexts.

引用

页数：20

共 30 条

[1]

Abbasi-Yadkori Yasin., 2011, Advances in Neural Information Processing Systems, P2312

[2]

Agarwal A, 2014, PR MACH LEARN RES, V32, P1638

[3]

Agarwal Alekh., 2016, Making contextual decisions with low technical debt

[4]

Agarwal Alekh, 2017, P C LEARNING THEORY, P4

[5]

Allen-Zhu Zeyuan, 2018, P MACHINE LEARNING R, V80

[6]

Allenberg C, 2006, LECT NOTES ARTIF INT, V4264, P229

[7]

[Anonymous], 2010, P 19 INT C WORLD WID

[8]

Auer P, 2003, SIAM J COMPUT, V32, P48, DOI 10.1137/S0097539701398375

[9] Finite-time analysis of the multiarmed bandit problem [J].

Auer, P ;

Cesa-Bianchi, N ;

Fischer, P .

MACHINE LEARNING, 2002, 47 (2-3) :235-256

[10]

Auer P., 2002, J MACH LEARN RES, V3, P397, DOI [10.5555/944919.944941, DOI 10.4271/610369]

← 1 2 3 →