Humans Use Directed and Random Exploration to Solve the Explore-Exploit Dilemma

被引:306
作者
Wilson, Robert C. [1 ]
Geana, Andra [1 ,2 ]
White, John M. [2 ]
Ludvig, Elliot A. [2 ,3 ]
Cohen, Jonathan D. [1 ,2 ]
机构
[1] Princeton Univ, Princeton Neurosci Inst, Princeton, NJ 08544 USA
[2] Princeton Univ, Dept Psychol, Princeton, NJ 08544 USA
[3] Univ Warwick, Dept Psychol, Coventry CV4 7AL, W Midlands, England
关键词
explore-exploit; decision making; information bonus; decision noise; reinforcement learning; DECISIONS; AMBIGUITY; VARIABILITY;
D O I
10.1037/a0038199
中图分类号
B84 [心理学];
学科分类号
04 ; 0402 ;
摘要
All adaptive organisms face the fundamental tradeoff between pursuing a known reward (exploitation) and sampling lesser-known options in search of something better (exploration). Theory suggests at least two strategies for solving this dilemma: a directed strategy in which choices are explicitly biased toward information seeking, and a random strategy in which decision noise leads to exploration by chance. In this work we investigated the extent to which humans use these two strategies. In our "Horizon task," participants made explore-exploit decisions in two contexts that differed in the number of choices that they would make in the future (the time horizon). Participants were allowed to make either a single choice in each game (horizon 1), or 6 sequential choices (horizon 6), giving them more opportunity to explore. By modeling the behavior in these two conditions, we were able to measure exploration-related changes in decision making and quantify the contributions of the two strategies to behavior. We found that participants were more information seeking and had higher decision noise with the longer horizon, suggesting that humans use both strategies to solve the exploration-exploitation dilemma. We thus conclude that both information seeking and choice variability can be controlled and put to use in the service of exploration.
引用
收藏
页码:2074 / 2081
页数:8
相关论文
共 35 条
[1]  
[Anonymous], 1959, INDIVIDUAL CHOICE BE
[2]   An integrative theory of locus coeruleus-norepinephrine function: Adaptive gain and optimal performance [J].
Aston-Jones, G ;
Cohen, JD .
ANNUAL REVIEW OF NEUROSCIENCE, 2005, 28 :403-450
[3]   Finite-time analysis of the multiarmed bandit problem [J].
Auer, P ;
Cesa-Bianchi, N ;
Fischer, P .
MACHINE LEARNING, 2002, 47 (2-3) :235-256
[4]   An experimental analysis of the bandit problem [J].
Banks, J ;
Olson, M ;
Porter, D .
ECONOMIC THEORY, 1997, 10 (01) :55-77
[5]   Not Noisy, Just Wrong: The Role of Suboptimal Inference in Behavioral Variability [J].
Beck, Jeffrey M. ;
Ma, Wei Ji ;
Pitkow, Xaq ;
Latham, Peter E. ;
Pouget, Alexandre .
NEURON, 2012, 74 (01) :30-39
[6]   AMBIGUITY SEEKING IN MULTIATTRIBUTE DECISIONS - EFFECTS OF OPTIMISM AND MESSAGE FRAMING [J].
BIER, VM ;
CONNELL, BL .
JOURNAL OF BEHAVIORAL DECISION MAKING, 1994, 7 (03) :169-182
[7]  
Bridle J. S., 1989, P P 2 INT C NEUR INF, P211, DOI DOI 10.5555/2969830
[8]   Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems [J].
Bubeck, Sebastien ;
Cesa-Bianchi, Nicolo .
FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2012, 5 (01) :1-122
[9]   RECENT DEVELOPMENTS IN MODELING PREFERENCES - UNCERTAINTY AND AMBIGUITY [J].
CAMERER, C ;
WEBER, M .
JOURNAL OF RISK AND UNCERTAINTY, 1992, 5 (04) :325-370
[10]   Cortical substrates for exploratory decisions in humans [J].
Daw, Nathaniel D. ;
O'Doherty, John P. ;
Dayan, Peter ;
Seymour, Ben ;
Dolan, Raymond J. .
NATURE, 2006, 441 (7095) :876-879