Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control

被引:0
|
作者
Cao, Jiaqing [1 ]
Liu, Quan [1 ]
Wu, Lan [1 ]
Fu, Qiming [2 ]
Zhong, Shan [3 ]
机构
[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China
[2] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China
[3] Changshu Inst Technol, Sch Comp Sci & Engn, Changshu 215500, Peoples R China
基金
中国国家自然科学基金;
关键词
Reinforcement learning; Off-policy learning; Emphatic approach; Gradient temporal-difference learning; Gradient emphasis learning;
D O I
10.1007/s10489-023-04579-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Off-policy learning, where the goal is to learn about a policy of interest while following a different behavior policy, constitutes an important class of reinforcement learning problems. It is well-known that emphatic temporal-difference (TD) learning is a pioneering off-policy reinforcement learning method involving the use of the followon trace. Although the gradient emphasis learning (GEM) algorithm has recently been proposed to fix the problems of unbounded variance and large emphasis approximation error introduced by the followon trace from the perspective of stochastic approximation. This approach, however, is limited to a single gradient-TD2-style update instead of considering the update rules of other GTD algorithms. Overall, it remains an open question on how to better learn the emphasis for off-policy learning. In this paper, we rethink GEM and investigate introducing a novel two-time-scale algorithm called TD emphasis learning with gradient correction (TDEC) to learn the true emphasis. Further, we regularize the update to the secondary learning process of TDEC and obtain our final TD emphasis learning with regularized correction (TDERC) algorithm. We then apply the emphasis estimated by the proposed emphasis learning algorithms to the value estimation gradient and the policy gradient, respectively, yielding the corresponding emphatic TD variants for off-policy evaluation and actor-critic algorithms for off-policy control. Finally, we empirically demonstrate the advantage of the proposed algorithms on a small domain as well as challenging Mujoco robot simulation tasks. Taken together, we hope that our work can provide new insights into the development of a better alternative in the family of off-policy emphatic algorithms.
引用
收藏
页码:20917 / 20937
页数:21
相关论文
共 50 条
  • [31] Off-policy reinforcement learning algorithm for robust optimal control of uncertain nonlinear systems
    Amirparast, Ali
    Kamal Hosseini Sani, S.
    INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL, 2024, 34 (08) : 5419 - 5437
  • [32] Enhanced Strategies for Off-Policy Reinforcement Learning Algorithms in HVAC Control
    Chen, Zhe
    Jia, Qingshan
    2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024, 2024, : 1691 - 1696
  • [33] On the convergence of temporal-difference learning with linear function approximation
    Tadic, V
    MACHINE LEARNING, 2001, 42 (03) : 241 - 267
  • [34] On the Convergence of Temporal-Difference Learning with Linear Function Approximation
    Vladislav Tadić
    Machine Learning, 2001, 42 : 241 - 267
  • [35] Off-Policy Learning-to-Bid with AuctionGym
    Jeunen, Olivier
    Murphy, Sean
    Allison, Ben
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 4219 - 4228
  • [36] Model-free H control of Itô stochastic system via off-policy reinforcement learning
    Zhang, Weihai
    Guo, Jing
    Jiang, Xiushan
    AUTOMATICA, 2025, 174
  • [37] Off-Policy Prediction Learning: An Empirical Study of Online Algorithms
    Ghiassian, Sina
    Rafiee, Banafsheh
    Sutton, Richard S.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 15
  • [38] Off-policy correction algorithm for double Q network based on deep reinforcement learning
    Zhang, Qingbo
    Liu, Manlu
    Wang, Heng
    Qian, Weimin
    Zhang, Xinglang
    IET CYBER-SYSTEMS AND ROBOTICS, 2023, 5 (04)
  • [39] Off-Policy Evaluation in Doubly Inhomogeneous Environments
    Bian, Zeyu
    Shi, Chengchun
    Qi, Zhengling
    Wang, Lan
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2024,
  • [40] Sequential Search with Off-Policy Reinforcement Learning
    Miao, Dadong
    Wang, Yanan
    Tang, Guoyu
    Liu, Lin
    Xu, Sulong
    Long, Bo
    Xiao, Yun
    Wu, Lingfei
    Jiang, Yunjiang
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4006 - 4015