Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

被引:25
作者
Bennett, Daniel [1 ,2 ]
Niv, Yael [1 ,3 ]
Langdon, Angela J. [1 ]
机构
[1] Princeton Univ, Princeton Neurosci Inst, Princeton, NJ 08544 USA
[2] Monash Univ, Dept Psychiat, Clayton, Vic, Australia
[3] Princeton Univ, Dept Psychol, Princeton, NJ 08544 USA
基金
英国医学研究理事会;
关键词
DECISION VARIABLES; HUMAN STRIATUM; BASAL GANGLIA; DOPAMINE; PREDICTION; SIGNALS; CHOICE; STATE; CIRCUITRY; ENCODE;
D O I
10.1016/j.cobeha.2021.04.020
中图分类号
B84 [心理学]; C [社会科学总论]; Q98 [人类学];
学科分类号
03 ; 0303 ; 030303 ; 04 ; 0402 ;
摘要
Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of 'value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.
引用
收藏
页码:114 / 121
页数:8
相关论文
共 68 条
  • [31] Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments
    Leong, Yuan Chang
    Radulescu, Angela
    Daniel, Reka
    DeWoskin, Vivian
    Niv, Yael
    [J]. NEURON, 2017, 93 (02) : 451 - 463
  • [32] The root of all value: a neural common currency for choice
    Levy, Dino J.
    Glimcher, Paul W.
    [J]. CURRENT OPINION IN NEUROBIOLOGY, 2012, 22 (06) : 1027 - 1038
  • [33] Signals in Human Striatum Are Appropriate for Policy Update Rather than Value Prediction
    Li, Jian
    Daw, Nathaniel D.
    [J]. JOURNAL OF NEUROSCIENCE, 2011, 31 (14) : 5504 - 5511
  • [34] Basolateral Amygdala to Orbitofrontal Cortex Projections Enable Cue-Triggered Reward Expectations
    Lichtenberg, Nina T.
    Pennington, Zachary T.
    Holley, Sandra M.
    Greenfield, Venuz Y.
    Cepeda, Carlos
    Levine, Michael S.
    Wassum, Kate M.
    [J]. JOURNAL OF NEUROSCIENCE, 2017, 37 (35) : 8374 - 8384
  • [35] Two-factor theory, the actor-critic model, and conditioned avoidance
    Maia, Tiago V.
    [J]. LEARNING & BEHAVIOR, 2010, 38 (01) : 50 - 67
  • [36] Taking Aim at the Cognitive Side of Learning in Sensorimotor Adaptation Tasks
    McDougle, Samuel D.
    Ivry, Richard B.
    Taylor, Jordan A.
    [J]. TRENDS IN COGNITIVE SCIENCES, 2016, 20 (07) : 535 - 544
  • [37] Habits Without Values
    Miller, Kevin J.
    Shenhav, Amitai
    Ludvig, Elliot A.
    [J]. PSYCHOLOGICAL REVIEW, 2019, 126 (02) : 292 - 311
  • [38] Mnih V, 2016, PR MACH LEARN RES, V48
  • [39] The Misbehavior of Reinforcement Learning
    Mongillo, Gianluigi
    Shteingart, Hanan
    Loewenstein, Yonatan
    [J]. PROCEEDINGS OF THE IEEE, 2014, 102 (04) : 528 - 541
  • [40] Nachum O, 2017, ADV NEUR IN, V30