First- and Second-Order Bounds for Adversarial Linear Contextual Bandits

被引:0
作者
Olkhovskaya, Julia [1 ,5 ]
Mayo, Jack [2 ]
van Erven, Tim [2 ]
Neu, Gergely [3 ]
Wei, Chen-Yu [4 ]
机构
[1] Delft Univ Technol, Dept Intelligent Syst, Delft, Netherlands
[2] Univ Amsterdam, Korteweg de Vries Inst Math, Amsterdam, Netherlands
[3] Univ Pompeu Fabra, AI Grp, DTIC, Barcelona, Spain
[4] MIT, MIT Inst Data Syst & Soc, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[5] Vrije Univ Amsterdam, Amsterdam, Netherlands
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
基金
欧洲研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of K arms to change over time without restriction. Assuming the d-dimensional contexts are drawn from a fixed known distribution, the worst-case expected regret over the course of T rounds is known to scale as (O) over tilde(root KdT). Under the additional assumption that the density of the contexts is log-concave, we obtain a second-order bound of order (O) over tilde(root dV(T)) in terms of the cumulative second moment of the learner's losses V-T, and a closely related first-order bound of order (O) over tilde (K root dL(T)*) in terms of the cumulative loss of the best policy L-T*. Since V-T or L-T* may be significantly smaller than T, these improve over the worst-case regret whenever the environment is relatively benign. Our results are obtained using a truncated version of the continuous exponential weights algorithm over the probability simplex, which we analyse by exploiting a novel connection to the linear bandit setting without contexts.
引用
收藏
页数:20
相关论文
共 30 条
[1]  
Abbasi-Yadkori Yasin., 2011, Advances in Neural Information Processing Systems, P2312
[2]  
Agarwal A, 2014, PR MACH LEARN RES, V32, P1638
[3]  
Agarwal Alekh., 2016, Making contextual decisions with low technical debt
[4]  
Agarwal Alekh, 2017, P C LEARNING THEORY, P4
[5]  
Allen-Zhu Zeyuan, 2018, P MACHINE LEARNING R, V80
[6]  
Allenberg C, 2006, LECT NOTES ARTIF INT, V4264, P229
[7]  
[Anonymous], 2010, P 19 INT C WORLD WID
[8]  
Auer P, 2003, SIAM J COMPUT, V32, P48, DOI 10.1137/S0097539701398375
[9]   Finite-time analysis of the multiarmed bandit problem [J].
Auer, P ;
Cesa-Bianchi, N ;
Fischer, P .
MACHINE LEARNING, 2002, 47 (2-3) :235-256
[10]  
Auer P., 2002, J MACH LEARN RES, V3, P397, DOI [10.5555/944919.944941, DOI 10.4271/610369]