Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

被引:25
作者
Bennett, Daniel [1 ,2 ]
Niv, Yael [1 ,3 ]
Langdon, Angela J. [1 ]
机构
[1] Princeton Univ, Princeton Neurosci Inst, Princeton, NJ 08544 USA
[2] Monash Univ, Dept Psychiat, Clayton, Vic, Australia
[3] Princeton Univ, Dept Psychol, Princeton, NJ 08544 USA
基金
英国医学研究理事会;
关键词
DECISION VARIABLES; HUMAN STRIATUM; BASAL GANGLIA; DOPAMINE; PREDICTION; SIGNALS; CHOICE; STATE; CIRCUITRY; ENCODE;
D O I
10.1016/j.cobeha.2021.04.020
中图分类号
B84 [心理学]; C [社会科学总论]; Q98 [人类学];
学科分类号
03 ; 0303 ; 030303 ; 04 ; 0402 ;
摘要
Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of 'value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.
引用
收藏
页码:114 / 121
页数:8
相关论文
共 68 条
  • [11] Dopamine neuron activity before action initiation gates and invigorates future movements
    da Silva, Joaquim Alves
    Tecuapetla, Fatuel
    Paixao, Vitor
    Costa, Rui M.
    [J]. NATURE, 2018, 554 (7691) : 244 - +
  • [12] Model-Based Influences on Humans' Choices and Striatal Prediction Errors
    Daw, Nathaniel D.
    Gershman, Samuel J.
    Seymour, Ben
    Dayan, Peter
    Dolan, Raymond J.
    [J]. NEURON, 2011, 69 (06) : 1204 - 1215
  • [13] Model-based and model-free Pavlovian reward learning: Revaluation, revision, and revelation
    Dayan, Peter
    Berridge, Kent C.
    [J]. COGNITIVE AFFECTIVE & BEHAVIORAL NEUROSCIENCE, 2014, 14 (02) : 473 - 492
  • [14] Degris T, 2012, P AMER CONTR CONF, P2177
  • [15] Reinforcement learning in continuous time and space
    Doya, K
    [J]. NEURAL COMPUTATION, 2000, 12 (01) : 219 - 245
  • [16] Elber-Dorozko L, 2018, ELIFE, P32
  • [17] Arithmetic and local circuitry underlying dopamine prediction errors
    Eshel, Neir
    Bukwich, Michael
    Rao, Vinod
    Hemmelder, Vivian
    Tian, Ju
    Uchida, Naoshige
    [J]. NATURE, 2015, 525 (7568) : 243 - +
  • [18] The Origins and Organization of Vertebrate Pavlovian Conditioning
    Fanselow, Michael S.
    Wassum, Kate M.
    [J]. COLD SPRING HARBOR PERSPECTIVES IN BIOLOGY, 2016, 8 (01):
  • [19] Action-Specific Value Signals in Reward-Related Regions of the Human Brain
    FitzGerald, Thomas H. B.
    Friston, Karl J.
    Dolan, Raymond J.
    [J]. JOURNAL OF NEUROSCIENCE, 2012, 32 (46) : 16417 - +
  • [20] Glimcher P. W., 2014, NEUROECONOMICS DECIS, V2nd, P373