Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

被引：75

作者：

Vasilaki, Eleni ^{[1
,2
]}

Fremaux, Nicolas ^{[1
]}

Urbanczik, Robert ^{[3
]}

Senn, Walter ^{[3
]}

Gerstner, Wulfram ^{[1
]}

机构：

[1] Ecole Polytech Fed Lausanne, Lab Computat Neurosci, CH-1015 Lausanne, Switzerland

[2] Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England

[3] Univ Bern, Dept Physiol, CH-3012 Bern, Switzerland

来源：

PLOS COMPUTATIONAL BIOLOGY | 2009年 / 5卷 / 12期

关键词：

NEOCORTICAL PYRAMIDAL NEURONS; LONG-TERM POTENTIATION; SYNAPTIC PLASTICITY; DEPENDENT PLASTICITY; MODEL; PLACE; NAVIGATION; MEMORY; REWARD; HIPPOCAMPUS;

D O I：

10.1371/journal.pcbi.1000586

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Changes of synaptic connections between neurons are thought to be the physiological basis of learning. These changes can be gated by neuromodulators that encode the presence of reward. We study a family of reward-modulated synaptic learning rules for spiking neurons on a learning task in continuous space inspired by the Morris Water maze. The synaptic update rule modifies the release probability of synaptic transmission and depends on the timing of presynaptic spike arrival, postsynaptic action potentials, as well as the membrane potential of the postsynaptic neuron. The family of learning rules includes an optimal rule derived from policy gradient methods as well as reward modulated Hebbian learning. The synaptic update rule is implemented in a population of spiking neurons using a network architecture that combines feedforward input with lateral connections. Actions are represented by a population of hypothetical action cells with strong mexican-hat connectivity and are read out at theta frequency. We show that in this architecture, a standard policy gradient rule fails to solve the Morris watermaze task, whereas a variant with a Hebbian bias can learn the task within 20 trials, consistent with experiments. This result does not depend on implementation details such as the size of the neuronal populations. Our theoretical approach shows how learning new behaviors can be linked to reward-modulated plasticity at the level of single synapses and makes predictions about the voltage and spike-timing dependence of synaptic plasticity and the influence of neuromodulators such as dopamine. It is an important step towards connecting formal theories of reinforcement learning with neuronal and synaptic properties.

引用

页数：17

共 83 条

[1] Synaptic plasticity: taming the beast
Abbott, L. F.
Nelson, Sacha B.
[J]. NATURE NEUROSCIENCE, 2000, 3 (11) : 1178 - 1183
[2] [Anonymous], 1989, LEARNING DELAYED REW
[3] Spatial cognition and neuro-mimetic navigation: a model of hippocampal place cell activity
Arleo, A
Gerstner, W
[J]. BIOLOGICAL CYBERNETICS, 2000, 83 (03) : 287 - 299
[4] Reinforcement learning, spike-time-dependent plasticity, and the BCM rule
Baras, Dorit
Meir, Ron
[J]. NEURAL COMPUTATION, 2007, 19 (08) : 2245 - 2279
[5] NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS
BARTO, AG
SUTTON, RS
ANDERSON, CW
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05): : 834 - 846
[6] BARTO AG, 1985, HUM NEUROBIOL, V4, P229
[7] Experiments with infinite-horizon, policy-gradient estimation
Baxter, J
Bartlett, PL
Weaver, L
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 : 351 - 381
[8] THEORY FOR THE DEVELOPMENT OF NEURON SELECTIVITY - ORIENTATION SPECIFICITY AND BINOCULAR INTERACTION IN VISUAL-CORTEX
BIENENSTOCK, EL
COOPER, LN
MUNRO, PW
[J]. JOURNAL OF NEUROSCIENCE, 1982, 2 (01) : 32 - 48
[9] A SYNAPTIC MODEL OF MEMORY - LONG-TERM POTENTIATION IN THE HIPPOCAMPUS
BLISS, TVP
COLLINGRIDGE, GL
[J]. NATURE, 1993, 361 (6407) : 31 - 39
[10] Tag-Trigger-Consolidation: A Model of Early and Late Long-Term-Potentiation and Depression
Clopath, Claudia
Ziegler, Lorric
Vasilaki, Eleni
Buesing, Lars
Gerstner, Wulfram
[J]. PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (12)

← 1 2 3 4 5 6 7 8 9 →