Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

被引:29
作者
Bennett, Daniel [1 ,2 ]
Niv, Yael [1 ,3 ]
Langdon, Angela J. [1 ]
机构
[1] Princeton Univ, Princeton Neurosci Inst, Princeton, NJ 08544 USA
[2] Monash Univ, Dept Psychiat, Clayton, Vic, Australia
[3] Princeton Univ, Dept Psychol, Princeton, NJ 08544 USA
基金
英国医学研究理事会;
关键词
DECISION VARIABLES; HUMAN STRIATUM; BASAL GANGLIA; DOPAMINE; PREDICTION; SIGNALS; CHOICE; STATE; CIRCUITRY; ENCODE;
D O I
10.1016/j.cobeha.2021.04.020
中图分类号
B84 [心理学]; C [社会科学总论]; Q98 [人类学];
学科分类号
03 ; 0303 ; 030303 ; 04 ; 0402 ;
摘要
Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of 'value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.
引用
收藏
页码:114 / 121
页数:8
相关论文
共 69 条
[1]   State-dependent valuation learning in fish: Banded tetras prefer stimuli associated with greater past deprivation [J].
Aw, J. M. ;
Holbrook, R. I. ;
de Perera, T. Burt ;
Kacelnik, A. .
BEHAVIOURAL PROCESSES, 2009, 81 (02) :333-336
[2]   Joint modeling of reaction times and choice improves parameter identifiability in reinforcement learning models [J].
Ballard, Ian C. ;
McClure, Samuel M. .
JOURNAL OF NEUROSCIENCE METHODS, 2019, 317 :37-44
[3]   Goal-directed instrumental action: contingency and incentive learning and their cortical substrates [J].
Balleine, BW ;
Dickinson, A .
NEUROPHARMACOLOGY, 1998, 37 (4-5) :407-419
[4]  
Barto A. G., 1994, MODELS INFORM PROCES
[5]  
Bennett D, PSYCHOL REV, V2021
[6]  
Busemeyer J.R., 2015, The Oxford Handbook of Computational and Mathematical Psychology, P300, DOI [10.1093/oxfordhb/9780199957996, DOI 10.1093/OXFORDHB/9780199957996.013.14]
[7]   Heterogeneous Coding of Temporally Discounted Values in the Dorsal and Ventral Striatum during Intertemporal Choice [J].
Cai, Xinying ;
Kim, Soyoun ;
Lee, Daeyeol .
NEURON, 2011, 69 (01) :170-182
[8]   The timing of action determines reward prediction signals in identified midbrain dopamine neurons [J].
Coddington, Luke T. ;
Dudman, Joshua T. .
NATURE NEUROSCIENCE, 2018, 21 (11) :1563-+
[9]   Opponent Actor Learning (OpAL): Modeling Interactive Effects of Striatal Dopamine on Reinforcement Learning and Choice Incentive [J].
Collins, Anne G. E. ;
Frank, Michael J. .
PSYCHOLOGICAL REVIEW, 2014, 121 (03) :337-366
[10]   Dopamine neuron activity before action initiation gates and invigorates future movements [J].
da Silva, Joaquim Alves ;
Tecuapetla, Fatuel ;
Paixao, Vitor ;
Costa, Rui M. .
NATURE, 2018, 554 (7691) :244-+