Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

被引：29

作者：

Bennett, Daniel ^{[1
,2
]}

Niv, Yael ^{[1
,3
]}

Langdon, Angela J. ^{[1
]}

机构：

[1] Princeton Univ, Princeton Neurosci Inst, Princeton, NJ 08544 USA

[2] Monash Univ, Dept Psychiat, Clayton, Vic, Australia

[3] Princeton Univ, Dept Psychol, Princeton, NJ 08544 USA

来源：

CURRENT OPINION IN BEHAVIORAL SCIENCES | 2021年 / 41卷

基金：

英国医学研究理事会;

关键词：

DECISION VARIABLES; HUMAN STRIATUM; BASAL GANGLIA; DOPAMINE; PREDICTION; SIGNALS; CHOICE; STATE; CIRCUITRY; ENCODE;

D O I：

10.1016/j.cobeha.2021.04.020

中图分类号：

B84 [心理学]; C [社会科学总论]; Q98 [人类学];

学科分类号：

03 ; 0303 ; 030303 ; 04 ; 0402 ;

摘要：

Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of 'value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.

引用

页码：114 / 121

页数：8

共 69 条

[1] State-dependent valuation learning in fish: Banded tetras prefer stimuli associated with greater past deprivation [J].

Aw, J. M. ;

Holbrook, R. I. ;

de Perera, T. Burt ;

Kacelnik, A. .

BEHAVIOURAL PROCESSES, 2009, 81 (02) :333-336

[2] Joint modeling of reaction times and choice improves parameter identifiability in reinforcement learning models [J].

Ballard, Ian C. ;

McClure, Samuel M. .

JOURNAL OF NEUROSCIENCE METHODS, 2019, 317 :37-44

[3] Goal-directed instrumental action: contingency and incentive learning and their cortical substrates [J].

Balleine, BW ;

Dickinson, A .

NEUROPHARMACOLOGY, 1998, 37 (4-5) :407-419

[4]

Barto A. G., 1994, MODELS INFORM PROCES

[5]

Bennett D, PSYCHOL REV, V2021

[6]

Busemeyer J.R., 2015, The Oxford Handbook of Computational and Mathematical Psychology, P300, DOI [10.1093/oxfordhb/9780199957996, DOI 10.1093/OXFORDHB/9780199957996.013.14]

[7] Heterogeneous Coding of Temporally Discounted Values in the Dorsal and Ventral Striatum during Intertemporal Choice [J].