Multi-armed bandits: Theory and applications to online learning in networks

被引:0
作者
Zhao Q. [1 ]
机构
[1] Cornell University, United States
来源
Synthesis Lectures on Communication Networks | 2020年 / 12卷 / 01期
基金
美国国家科学基金会; 欧盟地平线“2020”;
关键词
Machine learning; Markov decision processes; Multi-armed bandit; Online learning; Reinforcement learning;
D O I
10.2200/S00941ED2V01Y201907CNT022
中图分类号
学科分类号
摘要
Multi-armed bandit problems pertain to optimal sequential decision making and learning in unknown environments. Since the first bandit problem posed by Thompson in 1933 for the application of clinical trials, bandit problems have enjoyed lasting attention from multiple research communities and have found a wide range of applications across diverse domains. This book covers classic results and recent development on both Bayesian and frequentist bandit problems. We start in Chapter 1 with a brief overview on the history of bandit problems, contrasting the two schools Bayesian and frequentist of approaches and highlighting foundational results and key applications. Chapters 2 and 4 cover, respectively, the canonical Bayesian and frequentist bandit models. In Chapters 3 and 5, we discuss major variants of the canonical bandit models that lead to new directions, bring in new techniques, and broaden the applications of this classical problem. In Chapter 6, we present several representative application examples in communication networks and social-economic systems, aiming to illuminate the connections between the Bayesian and the frequentist formulations of bandit problems and how structural results pertaining to one may be leveraged to obtain solutions under the other. © 2020 by Morgan & Claypool. All rights reserved.
引用
收藏
页码:1 / 165
页数:164
相关论文
共 213 条
[21]  
Audibert J., Munos R., Szepesvari C., Exploration-exploitation trade-off using variance estimates in multi-armed bandits, Theoretical Computer Science, pp. 1876-1902, (2009)
[22]  
Audibert J., Bubeck S., Munos R., Best arm identification in multi-armed bandits, Proc. Of the 23rd Annual Conference on Learning Theory (COLT), (2010)
[23]  
Auer P., Cesa-Bianchi N., Fischer P., Finite-time analysis of the multi-armed bandit problem, Machine Learning, 47, pp. 235-256, (2002)
[24]  
Auer P., Cesa-Bianchi N., Freund Y., Schapire R.E., The nonstochastic multi-armed bandit problem, SIAM Journal on Computing, 32, pp. 48-77, (2003)
[25]  
Auer P., Ortner R., Szepesvari C., Improved rates for the stochastic continuum-armed bandit problem, Proc. Of the 23rd Annual Conference on Learning Theory (COLT), pp. 454-468, (2007)
[26]  
Aviv Y., Pazgal A., A partially observed Markov decision process for dynamic pricing, Management Science, 51, pp. 1400-1416, (2005)
[27]  
Awerbuch B., Kleinberg R., Online linear optimization and adaptive routing, Journal of Computer and System Sciences, pp. 97-114, (2008)
[28]  
Badanidiyuru A., Kleinberg R., Slivkins A., Bandits with knapsacks, IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 207-216, (2013)
[29]  
Banks J., Sundaram R., Switching costs and the Gittins index, Econometrica, 62, pp. 687-694, (1994)
[30]  
Bellman R., A problem in the sequential design of experiments, Sankhia, 16, pp. 221-229, (1956)