Gradient temporal-difference learning for off-policy evaluation using emphatic weightings

被引:7
作者
Cao, Jiaqing [1 ]
Liu, Quan [1 ,2 ,3 ]
Zhu, Fei [1 ]
Fu, Qiming [4 ]
Zhong, Shan [5 ]
机构
[1] Soochow Univ, Sch Comp Sci & Technol, Prov Key Lab Comp Informat Proc Technol, Suzhou 215006, Peoples R China
[2] Jilin Univ, Key Lab Symbol Computat & Knowledge Engn, Minist Educ, Changchun 130012, Peoples R China
[3] Collaborat Innovat Ctr Novel Software Technol & I, Nanjing 210000, Peoples R China
[4] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China
[5] Changshu Inst Technol, Sch Comp Sci & Engn, Changshu 215500, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Reinforcement learning; Off-policy evaluation; Temporal-difference learning; Gradient temporal-difference learning; Emphatic approach;
D O I
10.1016/j.ins.2021.08.082
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The problem of off-policy evaluation (OPE) has long been advocated as one of the foremost challenges in reinforcement learning. Gradient-based and emphasis-based temporal -difference (TD) learning algorithms comprise the major part of off-policy TD learning methods. In this work, we investigate the derivation of efficient OPE algorithms from a novel perspective based on the advantages of these two categories. The gradient-based framework is adopted, and the emphatic approach is used to improve convergence perfor-mance. We begin by proposing a new analogue of the on-policy objective, called the distribution-correction-based mean square projected Bellman error (DC-MSPBE). The key to the construction of DC-MSPBE is the use of emphatic weightings on the representable subspace of the original MSPBE. Based on this objective function, the emphatic TD with lower-variance gradient correction (ETD-LVC) algorithm is proposed. Under standard off -policy and stochastic approximation conditions, we provide the convergence analysis of the ETD-LVC algorithm in the case of linear function approximation. Further, we generalize the algorithm to nonlinear smooth function approximation. Finally, we empirically demon-strate the improved performance of our ETD-LVC algorithm on off-policy benchmarks. Taken together, we hope that our work can guide the future discovery of a better alterna-tive in the off-policy TD learning algorithm family. (c) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页码:311 / 330
页数:20
相关论文
共 47 条
[1]  
[Anonymous], 2009, P 26 ANN INT C MACH
[2]  
Baird L., 1995, In Proceedings of the Twelfth International Conference on Machine Learning, P30
[3]   The ODE method for convergence of stochastic approximation and reinforcement learning [J].
Borkar, VS ;
Meyn, SP .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469
[4]  
Brandfonbrener D., 2020, INT C LEARN REPR
[5]  
Cai Q., 2019, Advances in Neural Information Processing Systems, P11312
[6]  
Dann C, 2014, J MACH LEARN RES, V15, P809
[7]  
De Asis K., 2020, P 34 AAAI C ART INT, P9337
[8]   A Convergent Off-Policy Temporal Difference Algorithm [J].
Diddigi, Raghuram Bharadwaj ;
Kamanchi, Chandramouli ;
Bhatnagar, Shalabh .
ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 :1103-1110
[9]  
Ghiassian S., 2020, INT C MACHINE LEARNI, P3524
[10]  
Ghiassian S., 2018, ARXIV PREPRINT ARXIV