Deep Reinforcement Learning With Modulated Hebbian Plus Q-Network Architecture

被引:14
作者
Ladosz, Pawel [1 ,2 ]
Ben-Iwhiwhu, Eseoghene [1 ]
Dick, Jeffery [1 ]
Ketz, Nicholas [3 ]
Kolouri, Soheil [3 ,4 ]
Krichmar, Jeffrey L. [5 ,6 ]
Pilly, Praveen K. [3 ]
Soltoggio, Andrea [1 ]
机构
[1] Loughborough Univ, Dept Comp Sci, Loughborough LE11 3TU, Leics, England
[2] Natl Inst Sci & Technol UNIST, Sch Mech & Nucl Engn, Ulsan 44919, South Korea
[3] HRL Labs, Informat & Syst Sci Lab, Malibu, CA 90265 USA
[4] Vanderbilt Univ, Comp Sci Dept, Nashville, TN 37235 USA
[5] Univ Calif Irvine, Dept Cognit Sci, Irvine, CA 92697 USA
[6] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92697 USA
基金
新加坡国家研究基金会;
关键词
Reinforcement learning; History; Markov processes; Benchmark testing; Delays; Decision making; Correlation; Biologically inspired learning; decision-making; deep reinforcement learning (RL); partially observable Markov decision process (POMDP); DISTAL REWARD PROBLEM;
D O I
10.1109/TNNLS.2021.3110281
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this article, we consider a subclass of partially observable Markov decision process (POMDP) problems which we termed confounding POMDPs. In these types of POMDPs, temporal difference (TD)-based reinforcement learning (RL) algorithms struggle, as TD error cannot be easily derived from observations. We solve these types of problems using a new bio-inspired neural architecture that combines a modulated Hebbian network (MOHN) with deep Q-network (DQN), which we call modulated Hebbian plus Q-network architecture (MOHQA). The key idea is to use a Hebbian network with rarely correlated bio-inspired neural traces to bridge temporal delays between actions and rewards when confounding observations and sparse rewards result in inaccurate TD errors. In MOHQA, DQN learns low-level features and control, while the MOHN contributes to high-level decisions by associating rewards with past states and actions. Thus, the proposed architecture combines two modules with significantly different learning algorithms, a Hebbian associative network and a classical DQN pipeline, exploiting the advantages of both. Simulations on a set of POMDPs and on the Malmo environment show that the proposed algorithm improved DQN's results and even outperformed control tests with advantage-actor critic (A2C), quantile regression DQN with long short-term memory (QRDQN + LSTM), Monte Carlo policy gradient (REINFORCE), and aggregated memory for reinforcement learning (AMRL) algorithms on most difficult POMDPs with confounding stimuli and sparse rewards.
引用
收藏
页码:2045 / 2056
页数:12
相关论文
共 33 条
[1]  
[Anonymous], 2016, CORR, DOI DOI 10.1016/j.neuroscience.2018.04.006
[2]  
Beck, 2020, P ICLR
[3]  
Deisenroth Marc Peter, 2011, P INT C MACH LEARN, P465
[4]  
Hausknecht M., 2015, 2015 AAAI FALL S SER
[5]  
HEBB D. O., 1949
[6]  
Heess N., 2015, ABS151204455 CORR
[7]  
Igl M, 2018, PR MACH LEARN RES, V80
[8]   Solving the distal reward problem through linkage of STDP and dopamine signaling [J].
Izhikevich, Eugene M. .
CEREBRAL CORTEX, 2007, 17 (10) :2443-2452
[9]  
Johnson M., 2016, INT JOINT C ART INT, P4246
[10]   Hebbian learning and spiking neurons [J].
Kempter, R ;
Gerstner, W ;
von Hemmen, JL .
PHYSICAL REVIEW E, 1999, 59 (04) :4498-4514