An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time

被引：32

作者：

Fairbank, Michael ^{[1
]}

Alonso, Eduardo ^{[1
]}

Prokhorov, Danil ^{[2
]}

机构：

[1] City Univ London, Sch Informat, Dept Comp Sci, London EC1V 0HB, England

[2] Toyota Res Inst NA, Ann Arbor, MI 48103 USA

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2013年 / 24卷 / 12期

关键词：

Adaptive dynamic programming (ADP); backpropagation through time; dual heuristic programming (DHP); neural networks; value-gradient learning; CONVERGENCE; DESIGNS;

D O I：

10.1109/TNNLS.2013.2271778

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We consider the adaptive dynamic programming technique called Dual Heuristic Programming (DHP), which is designed to learn a critic function, when using learned model functions of the environment. DHP is designed for optimizing control problems in large and continuous state spaces. We extend DHP into a new algorithm that we call Value-Gradient Learning, VGL (lambda), and prove equivalence of an instance of the new algorithm to Backpropagation Through Time for Control with a greedy policy. Not only does this equivalence provide a link between these two different approaches, but it also enables our variant of DHP to have guaranteed convergence, under certain smoothness conditions and a greedy policy, when using a general smooth nonlinear function approximator for the critic. We consider several experimental scenarios including some that prove divergence of DHP under a greedy policy, which contrasts against our proven-convergent algorithm.

引用

页码：2088 / 2100

页数：13

共 29 条

[1] Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof [J].

Al-Tamimi, Asma ;

Lewis, Frank L. ;

Abu-Khalaf, Murad .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2008, 38 (04) :943-949

[2]

[Anonymous], 2008, REINFORCEMENT LEARNI

[3]

[Anonymous], P INT C NEUR NETW

[4]

[Anonymous], 1962, MATH THEORY OPTIMAL

[5]

Baird L., 1995, MACHINE LEARNING P, P30, DOI [DOI 10.1016/B978-1-55860-377-6.50013-X, 10.1.1.48.3256.1,5.1]

[6] TEMPORAL DIFFERENCE-METHODS AND MARKOV-MODELS [J].

BARNARD, E .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1993, 23 (02) :357-365

[7] NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS [J].

BARTO, AG ;

SUTTON, RS ;

ANDERSON, CW .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05) :834-846

[8]

Bellman R. E., 1957, Dynamic programming. Princeton landmarks in mathematics

[9] Reinforcement learning in continuous time and space [J].

Doya, K .

NEURAL COMPUTATION, 2000, 12 (01) :219-245

[10]

Fairbank M., 2011, LOCAL OPTIMALITY REI

← 1 2 3 →