Gradient temporal-difference learning for off-policy evaluation using emphatic weightings

被引：7

作者：

Cao, Jiaqing ^{[1
]}

Liu, Quan ^{[1
,2
,3
]}

Zhu, Fei ^{[1
]}

Fu, Qiming ^{[4
]}

Zhong, Shan ^{[5
]}

机构：

[1] Soochow Univ, Sch Comp Sci & Technol, Prov Key Lab Comp Informat Proc Technol, Suzhou 215006, Peoples R China

[2] Jilin Univ, Key Lab Symbol Computat & Knowledge Engn, Minist Educ, Changchun 130012, Peoples R China

[3] Collaborat Innovat Ctr Novel Software Technol & I, Nanjing 210000, Peoples R China

[4] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China

[5] Changshu Inst Technol, Sch Comp Sci & Engn, Changshu 215500, Jiangsu, Peoples R China

来源：

INFORMATION SCIENCES | 2021年 / 580卷

基金：

中国国家自然科学基金;

关键词：

Reinforcement learning; Off-policy evaluation; Temporal-difference learning; Gradient temporal-difference learning; Emphatic approach;

D O I：

10.1016/j.ins.2021.08.082

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The problem of off-policy evaluation (OPE) has long been advocated as one of the foremost challenges in reinforcement learning. Gradient-based and emphasis-based temporal -difference (TD) learning algorithms comprise the major part of off-policy TD learning methods. In this work, we investigate the derivation of efficient OPE algorithms from a novel perspective based on the advantages of these two categories. The gradient-based framework is adopted, and the emphatic approach is used to improve convergence perfor-mance. We begin by proposing a new analogue of the on-policy objective, called the distribution-correction-based mean square projected Bellman error (DC-MSPBE). The key to the construction of DC-MSPBE is the use of emphatic weightings on the representable subspace of the original MSPBE. Based on this objective function, the emphatic TD with lower-variance gradient correction (ETD-LVC) algorithm is proposed. Under standard off -policy and stochastic approximation conditions, we provide the convergence analysis of the ETD-LVC algorithm in the case of linear function approximation. Further, we generalize the algorithm to nonlinear smooth function approximation. Finally, we empirically demon-strate the improved performance of our ETD-LVC algorithm on off-policy benchmarks. Taken together, we hope that our work can guide the future discovery of a better alterna-tive in the off-policy TD learning algorithm family. (c) 2021 Elsevier Inc. All rights reserved.

引用

页码：311 / 330

页数：20

共 47 条

[1]

[Anonymous], 2009, P 26 ANN INT C MACH

[2]

Baird L., 1995, In Proceedings of the Twelfth International Conference on Machine Learning, P30

[3] The ODE method for convergence of stochastic approximation and reinforcement learning [J].

Borkar, VS ;

Meyn, SP .

SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469

[4]

Brandfonbrener D., 2020, INT C LEARN REPR

[5]

Cai Q., 2019, Advances in Neural Information Processing Systems, P11312

[6]

Dann C, 2014, J MACH LEARN RES, V15, P809

[7]

De Asis K., 2020, P 34 AAAI C ART INT, P9337

[8] A Convergent Off-Policy Temporal Difference Algorithm [J].

Diddigi, Raghuram Bharadwaj ;

Kamanchi, Chandramouli ;

Bhatnagar, Shalabh .

ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 :1103-1110

[9]

Ghiassian S., 2020, INT C MACHINE LEARNI, P3524

[10]

Ghiassian S., 2018, ARXIV PREPRINT ARXIV

← 1 2 3 4 5 →