Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

被引:25
作者
Bennett, Daniel [1 ,2 ]
Niv, Yael [1 ,3 ]
Langdon, Angela J. [1 ]
机构
[1] Princeton Univ, Princeton Neurosci Inst, Princeton, NJ 08544 USA
[2] Monash Univ, Dept Psychiat, Clayton, Vic, Australia
[3] Princeton Univ, Dept Psychol, Princeton, NJ 08544 USA
基金
英国医学研究理事会;
关键词
DECISION VARIABLES; HUMAN STRIATUM; BASAL GANGLIA; DOPAMINE; PREDICTION; SIGNALS; CHOICE; STATE; CIRCUITRY; ENCODE;
D O I
10.1016/j.cobeha.2021.04.020
中图分类号
B84 [心理学]; C [社会科学总论]; Q98 [人类学];
学科分类号
03 ; 0303 ; 030303 ; 04 ; 0402 ;
摘要
Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of 'value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.
引用
收藏
页码:114 / 121
页数:8
相关论文
共 68 条