Temporal-difference learning with nonlinear function approximation: lazy training and mean field regimes

被引：0

作者：

Agazzi, Andrea ^{[1
]}

Lu, Jianfeng ^{[1
,2
,3
]}

机构：

[1] Duke Univ, Dept Math, Durham, NC 27708 USA

[2] Duke Univ, Dept Phys, Durham, NC 27708 USA

[3] Duke Univ, Dept Chem, Durham, NC 27708 USA

来源：

MATHEMATICAL AND SCIENTIFIC MACHINE LEARNING, VOL 145 | 2021年 / 145卷

关键词：

Reinforcement learning; neural networks; temporal-difference learning; mean-field; lazy training; REINFORCEMENT; GO;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We discuss the approximation of the value function for infinite-horizon discounted Markov Reward Processes (MRP) with wide neural networks trained with the Temporal-Difference (TD) learning algorithm. We first consider this problem under a certain scaling of the approximating function, leading to a regime called lazy training. In this regime, which arises naturally, implicit in the initialization of the neural network, the parameters of the model vary only slightly during the learning process, resulting in approximately linear behavior of the model. Both in the under- and over-parametrized frameworks, we prove exponential convergence to local, respectively global minimizers of the TD learning algorithm in the lazy training regime. We then compare the above scaling with the alternative mean-field scaling, where the approximately linear behavior of the model is lost. In this nonlinear, mean-field regime we prove that all fixed points of the dynamics in parameter space are global minimizers. We finally give examples of our convergence results in the case of models that diverge if trained with non-lazy TD learning.

引用

页码：37 / 74

页数：38

共 50 条

[41] An Adaptive Network Slice Combination Algorithm Based on Multistep Temporal-Difference Learning
Wu, Guomin
Tan, Guoping
IEEE WIRELESS COMMUNICATIONS LETTERS, 2022, 11 (06) : 1128 - 1132
[42] Particle swarm optimization based on temporal-difference learning for solving multi-objective optimization problems
Desong Zhang
Guangyu Zhu
Computing, 2023, 105 : 1795 - 1820
[43] Distributed consensus-based multi-agent temporal-difference learning
Stankovic, Milos S.
Beko, Marko
Stankovic, Srdjan S.
AUTOMATICA, 2023, 151
[44] Striatal and Tegmental Neurons Code Critical Signals for Temporal-Difference Learning of State Value in Domestic Chicks
Wen, Chentao
Ogura, Yukiko
Matsushima, Toshiya
FRONTIERS IN NEUROSCIENCE, 2016, 10
[45] GAUSSIAN PROCESS TEMPORAL-DIFFERENCE LEARNING WITH SCALABILITY AND WORST-CASE PERFORMANCE GUARANTEES
Lu, Qin
Giannakis, Georgios B.
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3485 - 3489
[46] VNE-TD: A virtual network embedding algorithm based on temporal-difference learning
Wang, Sen
Bi, Jun
Wu, Jianping
Vasilakos, Athanasios V.
Fan, Qilin
COMPUTER NETWORKS, 2019, 161 : 251 - 263
[47] Provable distributed adaptive temporal-difference learning over time-varying networks*
Zhu, Junlong
Li, Bing
Wang, Lin
Zhang, Mingchuan
Xing, Ling
Xi, Jiangtao
Wu, Qingtao
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 228
[48] Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control
Jiaqing Cao
Quan Liu
Lan Wu
Qiming Fu
Shan Zhong
Applied Intelligence, 2023, 53 : 20917 - 20937
[49] Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control
Cao, Jiaqing
Liu, Quan
Wu, Lan
Fu, Qiming
Zhong, Shan
APPLIED INTELLIGENCE, 2023, 53 (18) : 20917 - 20937
[50] Sequential anomaly detection based on temporal-difference learning: Principles, models and case studies
Xu, Xin
APPLIED SOFT COMPUTING, 2010, 10 (03) : 859 - 867

← 1 2 3 4 5 →