Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

被引：29

作者：

Bennett, Daniel ^{[1
,2
]}

Niv, Yael ^{[1
,3
]}

Langdon, Angela J. ^{[1
]}

机构：

[1] Princeton Univ, Princeton Neurosci Inst, Princeton, NJ 08544 USA

[2] Monash Univ, Dept Psychiat, Clayton, Vic, Australia

[3] Princeton Univ, Dept Psychol, Princeton, NJ 08544 USA

来源：

CURRENT OPINION IN BEHAVIORAL SCIENCES | 2021年 / 41卷

基金：

英国医学研究理事会;

关键词：

DECISION VARIABLES; HUMAN STRIATUM; BASAL GANGLIA; DOPAMINE; PREDICTION; SIGNALS; CHOICE; STATE; CIRCUITRY; ENCODE;

D O I：

10.1016/j.cobeha.2021.04.020

中图分类号：

B84 [心理学]; C [社会科学总论]; Q98 [人类学];

学科分类号：

03 ; 0303 ; 030303 ; 04 ; 0402 ;

摘要：

Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of 'value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.

引用

页码：114 / 121

页数：8

共 69 条

[31] Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments [J].