Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

被引:29
作者
Bennett, Daniel [1 ,2 ]
Niv, Yael [1 ,3 ]
Langdon, Angela J. [1 ]
机构
[1] Princeton Univ, Princeton Neurosci Inst, Princeton, NJ 08544 USA
[2] Monash Univ, Dept Psychiat, Clayton, Vic, Australia
[3] Princeton Univ, Dept Psychol, Princeton, NJ 08544 USA
基金
英国医学研究理事会;
关键词
DECISION VARIABLES; HUMAN STRIATUM; BASAL GANGLIA; DOPAMINE; PREDICTION; SIGNALS; CHOICE; STATE; CIRCUITRY; ENCODE;
D O I
10.1016/j.cobeha.2021.04.020
中图分类号
B84 [心理学]; C [社会科学总论]; Q98 [人类学];
学科分类号
03 ; 0303 ; 030303 ; 04 ; 0402 ;
摘要
Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of 'value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.
引用
收藏
页码:114 / 121
页数:8
相关论文
共 69 条
[31]   Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments [J].
Leong, Yuan Chang ;
Radulescu, Angela ;
Daniel, Reka ;
DeWoskin, Vivian ;
Niv, Yael .
NEURON, 2017, 93 (02) :451-463
[32]   The root of all value: a neural common currency for choice [J].
Levy, Dino J. ;
Glimcher, Paul W. .
CURRENT OPINION IN NEUROBIOLOGY, 2012, 22 (06) :1027-1038
[33]   Signals in Human Striatum Are Appropriate for Policy Update Rather than Value Prediction [J].
Li, Jian ;
Daw, Nathaniel D. .
JOURNAL OF NEUROSCIENCE, 2011, 31 (14) :5504-5511
[34]   Basolateral Amygdala to Orbitofrontal Cortex Projections Enable Cue-Triggered Reward Expectations [J].
Lichtenberg, Nina T. ;
Pennington, Zachary T. ;
Holley, Sandra M. ;
Greenfield, Venuz Y. ;
Cepeda, Carlos ;
Levine, Michael S. ;
Wassum, Kate M. .
JOURNAL OF NEUROSCIENCE, 2017, 37 (35) :8374-8384
[35]   Two-factor theory, the actor-critic model, and conditioned avoidance [J].
Maia, Tiago V. .
LEARNING & BEHAVIOR, 2010, 38 (01) :50-67
[36]   Taking Aim at the Cognitive Side of Learning in Sensorimotor Adaptation Tasks [J].
McDougle, Samuel D. ;
Ivry, Richard B. ;
Taylor, Jordan A. .
TRENDS IN COGNITIVE SCIENCES, 2016, 20 (07) :535-544
[37]   Habits Without Values [J].
Miller, Kevin J. ;
Shenhav, Amitai ;
Ludvig, Elliot A. .
PSYCHOLOGICAL REVIEW, 2019, 126 (02) :292-311
[38]  
Mnih V, 2016, PR MACH LEARN RES, V48
[39]   The Misbehavior of Reinforcement Learning [J].
Mongillo, Gianluigi ;
Shteingart, Hanan ;
Loewenstein, Yonatan .
PROCEEDINGS OF THE IEEE, 2014, 102 (04) :528-541
[40]  
Nachum O, 2017, ADV NEUR IN, V30