An Improved N-Step Value Gradient Learning Adaptive Dynamic Programming Algorithm for Online Learning

被引:28
作者
Al-Dabooni, Seaar [1 ,2 ]
Wunsch, Donald C., II [3 ]
机构
[1] Missouri Univ Sci & Technol, ACIL, Rolla, MO 65401 USA
[2] Basra Oil Co, Basra 61030, Iraq
[3] Missouri Univ Sci & Technol, Dept Elect & Comp Engn, ACIL, Rolla, MO 65401 USA
基金
美国国家科学基金会;
关键词
Adaptive dynamic programming (ADP); convergence analysis; eligibility traces; online learning; reinforcement learning; temporal difference (TD); value gradient learning (VGL); STABILITY ANALYSIS; NONLINEAR-SYSTEMS; TRACKING CONTROL; CONTINUOUS-TIME; HJB SOLUTION; BACKPROPAGATION; REPRESENTATION; APPROXIMATION; ARCHITECTURE;
D O I
10.1109/TNNLS.2019.2919338
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In problems with complex dynamics and challenging state spaces, the dual heuristic programming (DHP) algorithm has been shown theoretically and experimentally to perform well. This was recently extended by an approach called value gradient learning (VGL). VGL was inspired by a version of temporal difference (TD) learning that uses eligibility traces. The eligibility traces create an exponential decay of older observations with a decay parameter (lambda). This approach is known as TD(lambda), and its DHP extension is known as VGL(lambda), where VGL(0) is identical to DHP. VGL has presented convergence and other desirable properties, but it is primarily useful for batch learning. Online learning requires an eligibility-trace-work-space matrix, which is not required for the batch learning version of VGL. Since online learning is desirable for many applications, it is important to remove this computational and memory impediment. This paper introduces a dual-critic version of VGL, called N-step VGL (NSVGL), that does not need the eligibility-trace-workspace matrix, thereby allowing online learning. Furthermore, this combination of critic networks allows an NSVGL algorithm to learn faster. The first critic is similar to DHP, which is adapted based on TD(0) learning, while the second critic is adapted based on a gradient of n-step TD(lambda) learning. Both networks are combined to train an actor network. The combination of feedback signals from both critic networks provides an optimal decision faster than traditional adaptive dynamic programming (ADP) via mixing current information and event history. Convergence proofs are provided. Gradients of one- and n-step value functions are monotonically nondecreasing and converge to the optimum. Two simulation case studies are presented for NSVGL to show their superior performance.
引用
收藏
页码:1155 / 1169
页数:15
相关论文
共 62 条
  • [1] Al Dabooni S, 2016, IEEE IJCNN, P3723, DOI 10.1109/IJCNN.2016.7727679
  • [2] Al-Dabooni S, 2017, IEEE IJCNN, P2820, DOI 10.1109/IJCNN.2017.7966204
  • [3] Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof
    Al-Tamimi, Asma
    Lewis, Frank L.
    Abu-Khalaf, Murad
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2008, 38 (04): : 943 - 949
  • [4] [Anonymous], 1957, Dynamic Programming
  • [5] [Anonymous], IEEE T NEURAL NETW L
  • [6] [Anonymous], 2012, NEUR NETW IJCNN 2012
  • [7] [Anonymous], ARXIV11010428
  • [8] [Anonymous], IFAC P
  • [9] Approximate policy iteration: A survey and some new methods
    Bertsekas D.P.
    [J]. Journal of Control Theory and Applications, 2011, 9 (3): : 310 - 335
  • [10] Fidelity-Based Probabilistic Q-Learning for Control of Quantum Systems
    Chen, Chunlin
    Dong, Daoyi
    Li, Han-Xiong
    Chu, Jian
    Tarn, Tzyh-Jong
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2014, 25 (05) : 920 - 933