Approximate information for efficient exploration-exploitation strategies

被引：0

作者：

Barbier-Chebbah, Alex ^{[1
,2
]}

Vestergaard, Christian L. ^{[1
,2
]}

Masson, Jean-Baptiste ^{[1
,2
]}

机构：

[1] Univ Paris Cite, Inst Pasteur, CNRS UMR 3571, Decis & Bayesian Computat, F-75015 Paris, France

[2] Epimethee, Inria, F-75012 Paris, France

来源：

PHYSICAL REVIEW E | 2024年 / 109卷 / 05期

关键词：

CLINICAL-TRIALS; BANDIT MODELS; SPACE; GO;

D O I：

10.1103/PhysRevE.109.L052105

中图分类号：

O35 [流体力学]; O53 [等离子体物理学];

学科分类号：

070204 ; 080103 ; 080704 ;

摘要：

This paper addresses the exploration-exploitation dilemma inherent in decision-making, focusing on multiarmed bandit problems. These involve an agent deciding whether to exploit current knowledge for immediate gains or explore new avenues for potential long-term rewards. We here introduce a class of algorithms, approximate information maximization (AIM), which employs a carefully chosen analytical approximation to the gradient of the entropy to choose which arm to pull at each point in time. AIM matches the performance of Thompson sampling, which is known to be asymptotically optimal, as well as that of Infomax from which it derives. AIM thus retains the advantages of Infomax while also offering enhanced computational speed, tractability, and ease of implementation. In particular, we demonstrate how to apply it to a 50-armed bandit game. Its expression is tunable, which allows for specific optimization in various settings, making it possible to surpass the performance of Thompson sampling at short and intermediary times.

引用

页数：6

共 41 条

[31] Mastering the game of Go with deep neural networks and tree search
Silver, David
Huang, Aja
Maddison, Chris J.
Guez, Arthur
Sifre, Laurent
van den Driessche, George
Schrittwieser, Julian
Antonoglou, Ioannis
Panneershelvam, Veda
Lanctot, Marc
Dieleman, Sander
Grewe, Dominik
Nham, John
Kalchbrenner, Nal
Sutskever, Ilya
Lillicrap, Timothy
Leach, Madeleine
Kavukcuoglu, Koray
Graepel, Thore
Hassabis, Demis
[J]. NATURE, 2016, 529 (7587) : 484 - +
[32] Introduction to Multi-Armed Bandits Preface
Slivkins, Aleksandrs
[J]. FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2019, 12 (1-2): : 1 - 286
[33] Sutton RS, 2018, ADAPT COMPUT MACH LE, P1
[34] Behavioral Variability through Stochastic Choice and Its Gating by Anterior Cingulate Cortex
Tervo, Dougal G. R.
Proskurin, Mikhail
Manakov, Maxim
Kabra, Mayank
Vollmer, Alison
Branson, Kristin
Karpova, Alla Y.
[J]. CELL, 2014, 159 (01) : 21 - 32
[35] Thompson WR, 1933, BIOMETRIKA, V25, P285, DOI 10.2307/2332286
[36] 'Infotaxis' as a strategy for searching without gradients
Vergassola, Massimo
Villermaux, Emmanuel
Shraiman, Boris I.
[J]. NATURE, 2007, 445 (7126) : 406 - 409
[37] BANDIT STRATEGIES EVALUATED IN THE CONTEXT OF CLINICAL TRIALS IN RARE LIFE-THREATENING DISEASES
Villar, Sofia S.
[J]. PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2018, 32 (02) : 229 - 245
[38] Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges
Villar, Sofia S.
Bowden, Jack
Wason, James
[J]. STATISTICAL SCIENCE, 2015, 30 (02) : 199 - 215
[39] Balancing exploration and exploitation with information and randomization
Wilson, Robert C.
Bonawitz, Elizabeth
Costa, Vincent D.
Ebitz, R. Becket
[J]. CURRENT OPINION IN BEHAVIORAL SCIENCES, 2021, 38 : 49 - 56
[40] Multi-robot searching with sparse binary cues and limited space perception
Zhang, Siqi
Martinez, Dominique
Masson, Jean -Baptiste
[J]. FRONTIERS IN ROBOTICS AND AI, 2015,

← 1 2 3 4 5 →