An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time

被引:32
作者
Fairbank, Michael [1 ]
Alonso, Eduardo [1 ]
Prokhorov, Danil [2 ]
机构
[1] City Univ London, Sch Informat, Dept Comp Sci, London EC1V 0HB, England
[2] Toyota Res Inst NA, Ann Arbor, MI 48103 USA
关键词
Adaptive dynamic programming (ADP); backpropagation through time; dual heuristic programming (DHP); neural networks; value-gradient learning; CONVERGENCE; DESIGNS;
D O I
10.1109/TNNLS.2013.2271778
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider the adaptive dynamic programming technique called Dual Heuristic Programming (DHP), which is designed to learn a critic function, when using learned model functions of the environment. DHP is designed for optimizing control problems in large and continuous state spaces. We extend DHP into a new algorithm that we call Value-Gradient Learning, VGL (lambda), and prove equivalence of an instance of the new algorithm to Backpropagation Through Time for Control with a greedy policy. Not only does this equivalence provide a link between these two different approaches, but it also enables our variant of DHP to have guaranteed convergence, under certain smoothness conditions and a greedy policy, when using a general smooth nonlinear function approximator for the critic. We consider several experimental scenarios including some that prove divergence of DHP under a greedy policy, which contrasts against our proven-convergent algorithm.
引用
收藏
页码:2088 / 2100
页数:13
相关论文
共 29 条
[1]   Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof [J].
Al-Tamimi, Asma ;
Lewis, Frank L. ;
Abu-Khalaf, Murad .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2008, 38 (04) :943-949
[2]  
[Anonymous], 2008, REINFORCEMENT LEARNI
[3]  
[Anonymous], P INT C NEUR NETW
[4]  
[Anonymous], 1962, MATH THEORY OPTIMAL
[5]  
Baird L., 1995, MACHINE LEARNING P, P30, DOI [DOI 10.1016/B978-1-55860-377-6.50013-X, 10.1.1.48.3256.1,5.1]
[6]   TEMPORAL DIFFERENCE-METHODS AND MARKOV-MODELS [J].
BARNARD, E .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1993, 23 (02) :357-365
[7]   NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS [J].
BARTO, AG ;
SUTTON, RS ;
ANDERSON, CW .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05) :834-846
[8]  
Bellman R. E., 1957, Dynamic programming. Princeton landmarks in mathematics
[9]   Reinforcement learning in continuous time and space [J].
Doya, K .
NEURAL COMPUTATION, 2000, 12 (01) :219-245
[10]  
Fairbank M., 2011, LOCAL OPTIMALITY REI