Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control

被引：0

作者：

Cao, Jiaqing ^{[1
]}

Liu, Quan ^{[1
]}

Wu, Lan ^{[1
]}

Fu, Qiming ^{[2
]}

Zhong, Shan ^{[3
]}

机构：

[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China

[2] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China

[3] Changshu Inst Technol, Sch Comp Sci & Engn, Changshu 215500, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 18期

基金：

中国国家自然科学基金;

关键词：

Reinforcement learning; Off-policy learning; Emphatic approach; Gradient temporal-difference learning; Gradient emphasis learning;

D O I：

10.1007/s10489-023-04579-4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Off-policy learning, where the goal is to learn about a policy of interest while following a different behavior policy, constitutes an important class of reinforcement learning problems. It is well-known that emphatic temporal-difference (TD) learning is a pioneering off-policy reinforcement learning method involving the use of the followon trace. Although the gradient emphasis learning (GEM) algorithm has recently been proposed to fix the problems of unbounded variance and large emphasis approximation error introduced by the followon trace from the perspective of stochastic approximation. This approach, however, is limited to a single gradient-TD2-style update instead of considering the update rules of other GTD algorithms. Overall, it remains an open question on how to better learn the emphasis for off-policy learning. In this paper, we rethink GEM and investigate introducing a novel two-time-scale algorithm called TD emphasis learning with gradient correction (TDEC) to learn the true emphasis. Further, we regularize the update to the secondary learning process of TDEC and obtain our final TD emphasis learning with regularized correction (TDERC) algorithm. We then apply the emphasis estimated by the proposed emphasis learning algorithms to the value estimation gradient and the policy gradient, respectively, yielding the corresponding emphatic TD variants for off-policy evaluation and actor-critic algorithms for off-policy control. Finally, we empirically demonstrate the advantage of the proposed algorithms on a small domain as well as challenging Mujoco robot simulation tasks. Taken together, we hope that our work can provide new insights into the development of a better alternative in the family of off-policy emphatic algorithms.

引用

页码：20917 / 20937

页数：21

共 50 条

[41] Off-policy and on-policy reinforcement learning with the Tsetlin machine
Saeed Rahimi Gorji
Ole-Christoffer Granmo
Applied Intelligence, 2023, 53 : 8596 - 8613
[42] Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient
Tosatto, Samuele
Carvalho, Joao
Peters, Jan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5996 - 6010
[43] Debiased Off-Policy Evaluation for Recommendation Systems
Narita, Yusuke
Yasui, Shota
Yata, Kohei
15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 372 - 379
[44] Hyperparameter Tuning of an Off-Policy Reinforcement Learning Algorithm for H∞ Tracking Control
Farahmandi, Alireza
Reitz, Brian
Debord, Mark
Philbrick, Douglas
Estabridis, Katia
Hewer, Gary
LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211
[45] Off-policy and on-policy reinforcement learning with the Tsetlin machine
Gorji, Saeed Rahimi
Granmo, Ole-Christoffer
APPLIED INTELLIGENCE, 2023, 53 (08) : 8596 - 8613
[46] On the asymptotic behavior of a constant stepsize temporal-difference learning algorithm
Tadic, A
COMPUTATIONAL LEARNING THEORY, 1999, 1572 : 126 - 137
[47] Implementing Temporal-Difference Learning with the Scaled Conjugate Gradient Algorithm
Tasos Falas
Andreas Stafylopatis
Neural Processing Letters, 2005, 22 : 361 - 375
[48] Using temporal-difference learning for multi-agent bargaining
Huang, Shiu-li
Lin, Fu-ren
ELECTRONIC COMMERCE RESEARCH AND APPLICATIONS, 2008, 7 (04) : 432 - 442
[49] Temporal-Difference Learning An Online Support Vector Regression Approach
Teixeira, Hugo Tanzarella
Bottura, Celso Pascoli
ICIMCO 2015 PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON INFORMATICS IN CONTROL, AUTOMATION AND ROBOTICS, VOL. 1, 2015, : 318 - 323
[50] Correlation minimizing replay memory in temporal-difference reinforcement learning
Ramicic, Mirza
Bonarinib, Andrea
NEUROCOMPUTING, 2020, 393 : 91 - 100

← 1 2 3 4 5 →