Temporal-difference learning with nonlinear function approximation: lazy training and mean field regimes

被引:0
|
作者
Agazzi, Andrea [1 ]
Lu, Jianfeng [1 ,2 ,3 ]
机构
[1] Duke Univ, Dept Math, Durham, NC 27708 USA
[2] Duke Univ, Dept Phys, Durham, NC 27708 USA
[3] Duke Univ, Dept Chem, Durham, NC 27708 USA
来源
MATHEMATICAL AND SCIENTIFIC MACHINE LEARNING, VOL 145 | 2021年 / 145卷
关键词
Reinforcement learning; neural networks; temporal-difference learning; mean-field; lazy training; REINFORCEMENT; GO;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We discuss the approximation of the value function for infinite-horizon discounted Markov Reward Processes (MRP) with wide neural networks trained with the Temporal-Difference (TD) learning algorithm. We first consider this problem under a certain scaling of the approximating function, leading to a regime called lazy training. In this regime, which arises naturally, implicit in the initialization of the neural network, the parameters of the model vary only slightly during the learning process, resulting in approximately linear behavior of the model. Both in the under- and over-parametrized frameworks, we prove exponential convergence to local, respectively global minimizers of the TD learning algorithm in the lazy training regime. We then compare the above scaling with the alternative mean-field scaling, where the approximately linear behavior of the model is lost. In this nonlinear, mean-field regime we prove that all fixed points of the dynamics in parameter space are global minimizers. We finally give examples of our convergence results in the case of models that diverge if trained with non-lazy TD learning.
引用
收藏
页码:37 / 74
页数:38
相关论文
共 50 条
  • [41] An Adaptive Network Slice Combination Algorithm Based on Multistep Temporal-Difference Learning
    Wu, Guomin
    Tan, Guoping
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2022, 11 (06) : 1128 - 1132
  • [42] Particle swarm optimization based on temporal-difference learning for solving multi-objective optimization problems
    Desong Zhang
    Guangyu Zhu
    Computing, 2023, 105 : 1795 - 1820
  • [43] Distributed consensus-based multi-agent temporal-difference learning
    Stankovic, Milos S.
    Beko, Marko
    Stankovic, Srdjan S.
    AUTOMATICA, 2023, 151
  • [44] Striatal and Tegmental Neurons Code Critical Signals for Temporal-Difference Learning of State Value in Domestic Chicks
    Wen, Chentao
    Ogura, Yukiko
    Matsushima, Toshiya
    FRONTIERS IN NEUROSCIENCE, 2016, 10
  • [45] GAUSSIAN PROCESS TEMPORAL-DIFFERENCE LEARNING WITH SCALABILITY AND WORST-CASE PERFORMANCE GUARANTEES
    Lu, Qin
    Giannakis, Georgios B.
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3485 - 3489
  • [46] VNE-TD: A virtual network embedding algorithm based on temporal-difference learning
    Wang, Sen
    Bi, Jun
    Wu, Jianping
    Vasilakos, Athanasios V.
    Fan, Qilin
    COMPUTER NETWORKS, 2019, 161 : 251 - 263
  • [47] Provable distributed adaptive temporal-difference learning over time-varying networks*
    Zhu, Junlong
    Li, Bing
    Wang, Lin
    Zhang, Mingchuan
    Xing, Ling
    Xi, Jiangtao
    Wu, Qingtao
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 228
  • [48] Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control
    Jiaqing Cao
    Quan Liu
    Lan Wu
    Qiming Fu
    Shan Zhong
    Applied Intelligence, 2023, 53 : 20917 - 20937
  • [49] Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control
    Cao, Jiaqing
    Liu, Quan
    Wu, Lan
    Fu, Qiming
    Zhong, Shan
    APPLIED INTELLIGENCE, 2023, 53 (18) : 20917 - 20937
  • [50] Sequential anomaly detection based on temporal-difference learning: Principles, models and case studies
    Xu, Xin
    APPLIED SOFT COMPUTING, 2010, 10 (03) : 859 - 867