Gradient compensation traces based temporal difference learning

被引:2
作者
Wang Bi [1 ,2 ]
Li Xuelian [1 ,2 ]
Gao Zhiqiang [1 ,2 ]
Chen Yang [3 ]
机构
[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
[2] Southeast Univ, Key Lab Comp Network & Informat Integrat, Minist Educ, Nanjing 210096, Peoples R China
[3] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing, Peoples R China
关键词
Reinforcement learning; Eligibility traces; Value iteration; Temporal difference learning;
D O I
10.1016/j.neucom.2021.02.042
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For online updates and data efficiency, forward-view algorithms are transformed into backward-views, such as temporal difference learning (TD) and its control versions, by eligibility traces. Existing researches on eligibility traces, such as TD(A) and true-online TD(A), mainly focus on the equivalence between forward-views and backward-views. However, the choice of A refers to the time scope of the credit assignment, and a small A accelerates the decay of credit over the time. This paper takes a different implementation of the backward-view named gradient compensation traces (GCT). GCT compensates the difference between a bootstrapping estimated gradient and the true gradient online to remove the extra decay of the credit. Based on GCT, the corresponding temporal difference learning (gradient compensation TD, GCTD) is proved to converge conditionally. The sensitivity of GCTD's hyper-parameter is analyzed in the nonlinear long-corridor and linear random-walk task. The proposed algorithm is comparable with true-online TD(A) in the basic Mountain Car task, and outperforms the baselines in the reward sparse setting. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页码:221 / 235
页数:15
相关论文
共 40 条
[1]  
Amiranashvili A., 2018, INT C LEARN REPR ICL
[2]  
[Anonymous], 2009, P 26 ANN INT C MACH
[3]  
[Anonymous], 2010, Proceedings of the Third Conference on Artificial General Intelligence
[4]  
Baird L., 1995, Machine Learning. Proceedings of the Twelfth International Conference on Machine Learning, P30
[5]  
Brockman Greg, 2016, arXiv
[6]  
Christopher John Cornish Hellaby Watkins, 1989, Learning from delayed rewards
[7]  
Daley B., 2019, ADV NEURAL INFORM PR, V32, P1131
[8]  
Even-Dar E, 2003, J MACH LEARN RES, V5, P1
[9]   STOCHASTIC FIRST- AND ZEROTH-ORDER METHODS FOR NONCONVEX STOCHASTIC PROGRAMMING [J].
Ghadimi, Saeed ;
Lan, Guanghui .
SIAM JOURNAL ON OPTIMIZATION, 2013, 23 (04) :2341-2368
[10]  
Hallak A, 2016, AAAI CONF ARTIF INTE, P1631