Online Attentive Kernel-Based Off-Policy Temporal Difference Learning

被引：0

作者：

Yang, Shangdong ^{[1
]}

Zhang, Shuaiqiang ^{[1
]}

Chen, Xingguo ^{[1
]}

机构：

[1] Nanjing Univ Posts & Telecommun, Sch Comp Sci, Nanjing 210023, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 23期

基金：

中国国家自然科学基金;

关键词：

online attentive learning; kernel-based methods; reinforcement learning; off-policy temporal difference learning; two-timescale analysis; NEURAL-NETWORKS; STOCHASTIC-APPROXIMATION;

D O I：

10.3390/app142311114

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Temporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based off-policy TD learning, and in combination with well-known algorithms, introduce OAKGTD2, OAKTDC, and OAKETD. This framework uses two-timescale optimization. In the slow-timescale, a sparse representation of state features is learned using an online attentive kernel-based method. In the fast-timescale, auxiliary variables are used to update the value function parameters under the off-policy setting. We theoretically prove the convergence of all three algorithms. Through experiments conducted in several standard reinforcement learning environments, we demonstrate the effectiveness of the improved algorithms and compare their performance with existing algorithms. Specifically, from the perspective of cumulative rewards, the proposed algorithm achieves an average improvement of 15% compared to on-policy algorithms and an average improvement of 25% compared to common off-policy algorithms.

引用

页数：19

共 40 条

[1]

Barreto AMS, 2016, J MACH LEARN RES, V17

[2]

Borkar V.S., 2008, Stochastic approximation: a dynamical systems viewpoint, V9

[3] Stochastic approximation with two time scales [J].

Borkar, VS .

SYSTEMS & CONTROL LETTERS, 1997, 29 (05) :291-294

[4] The ODE method for convergence of stochastic approximation and reinforcement learning [J].

Borkar, VS ;

Meyn, SP .

SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2000, 38 (02) :447-469

[5] Online attentive kernel-based temporal difference learning [J].

Chen, Xingguo ;

Yang, Guang ;

Yang, Shangdong ;

Wang, Huihui ;

Dong, Shaokang ;

Gao, Yang .

KNOWLEDGE-BASED SYSTEMS, 2023, 278

[6] Online Selective Kernel-Based Temporal Difference Learning [J].

Chen, Xingguo ;

Gao, Yang ;

Wang, Ruili .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2013, 24 (12) :1944-1956

[7]

Chung W., 2019, P 7 INT C LEARN REPR

[8] REGULARIZATION THEORY AND NEURAL NETWORKS ARCHITECTURES [J].

GIROSI, F ;

JONES, M ;

POGGIO, T .

NEURAL COMPUTATION, 1995, 7 (02) :219-269

[9]

Haarnoja T, 2018, PR MACH LEARN RES, V80

[10] CONVERGENT ACTIVATION DYNAMICS IN CONTINUOUS-TIME NETWORKS [J].

HIRSCH, MW .

NEURAL NETWORKS, 1989, 2 (05) :331-349

← 1 2 3 4 →