The Boundedness Conditions for Model-Free HDP(lambda)

被引：14

作者：

Al-Dabooni, Seaar ^{[1
]}

Wunsch, Donald ^{[1
]}

机构：

[1] Missouri Univ Sci & Technol, Appl Computat Intelligence Lab, Dept Elect & Comp Engn, Rolla, MO 65401 USA

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2019年 / 30卷 / 07期

关键词：

lambda-return; action dependent (AD); approximate dynamic programing (ADP); heuristic dynamic programing (HDP); Lyapunov stability; model free; uniformly ultimately bounded (UUB); BACKPROPAGATION; REPRESENTATION;

D O I：

10.1109/TNNLS.2018.2875870

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper provides the stability analysis for a model-free action-dependent heuristic dynamic programing (HDP) approach with an eligibility trace long-term prediction parameter (lambda). HDP(lambda) learns from more than one future reward. Eligibility traces have long been popular in Q-learning. This paper proves and demonstrates that they are worthwhile to use with HDP. In this paper, we prove its uniformly ultimately bounded (UUB) property under certain conditions. Previous works present a UUB proof for traditional HDP [HDP(lambda = 0)], but we extend the proof with the lambda parameter. By using Lyapunov stability, we demonstrate the boundedness of the estimated error for the critic and actor neural networks as well as learning rate parameters. Three case studies demonstrate the effectiveness of HDP(lambda). The trajectories of the internal reinforcement signal nonlinear system are considered as the first case. We compare the results with the performance of HDP and traditional temporal difference [TD(lambda)] with different lambda values. The second case study is a single-link inverted pendulum. We investigate the performance of the inverted pendulum by comparing HDP(lambda) with regular HDP, with different levels of noise. The third case study is a 3-D maze navigation benchmark, which is compared with state action reward state action, Q(lambda), HDP, and HDP(lambda). All these simulation results illustrate that HDP(lambda) has a competitive performance; thus this contribution is not only UUB but also useful in comparison with traditional HDP.

引用

页码：1928 / 1942

页数：15

共 43 条

[1]

Al Dabooni S, 2016, IEEE IJCNN, P3723, DOI 10.1109/IJCNN.2016.7727679

[2]

Al-Dabooni S, 2017, IEEE IJCNN, P2820, DOI 10.1109/IJCNN.2017.7966204

[3]

[Anonymous], J MACH LEARN RES

[4]

Bai XR, 2009, ADPRL: 2009 IEEE SYMPOSIUM ON ADAPTIVE DYNAMIC PROGRAMMING AND REINFORCEMENT LEARNING, P22

[5] Ramp Metering Based on On-line ADHDP (λ) Controller [J].

Bai, Xuerui ;

Zhao, Dongbin ;

Yi, Jianqiang ;

Xu, Jing .

2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-8, 2008, :1847-1852

[6]

Bellman R. E., 2010, Dynamic Programming

[7] Fidelity-Based Probabilistic Q-Learning for Control of Quantum Systems [J].

Chen, Chunlin ;

Dong, Daoyi ;

Li, Han-Xiong ;

Chu, Jian ;

Tarn, Tzyh-Jong .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2014, 25 (05) :920-933

[8] Reinforcement learning in continuous time and space [J].

Doya, K .

NEURAL COMPUTATION, 2000, 12 (01) :219-245

[9]

Fairbank M., 2012, PROC INT JOINT C NEU, P1

[10]

Fairbank M., 2013, REINFORCEMENT LEARNI, P142

← 1 2 3 4 5 →