Temporal-difference learning with nonlinear function approximation: lazy training and mean field regimes

被引:0
|
作者
Agazzi, Andrea [1 ]
Lu, Jianfeng [1 ,2 ,3 ]
机构
[1] Duke Univ, Dept Math, Durham, NC 27708 USA
[2] Duke Univ, Dept Phys, Durham, NC 27708 USA
[3] Duke Univ, Dept Chem, Durham, NC 27708 USA
来源
MATHEMATICAL AND SCIENTIFIC MACHINE LEARNING, VOL 145 | 2021年 / 145卷
关键词
Reinforcement learning; neural networks; temporal-difference learning; mean-field; lazy training; REINFORCEMENT; GO;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We discuss the approximation of the value function for infinite-horizon discounted Markov Reward Processes (MRP) with wide neural networks trained with the Temporal-Difference (TD) learning algorithm. We first consider this problem under a certain scaling of the approximating function, leading to a regime called lazy training. In this regime, which arises naturally, implicit in the initialization of the neural network, the parameters of the model vary only slightly during the learning process, resulting in approximately linear behavior of the model. Both in the under- and over-parametrized frameworks, we prove exponential convergence to local, respectively global minimizers of the TD learning algorithm in the lazy training regime. We then compare the above scaling with the alternative mean-field scaling, where the approximately linear behavior of the model is lost. In this nonlinear, mean-field regime we prove that all fixed points of the dynamics in parameter space are global minimizers. We finally give examples of our convergence results in the case of models that diverge if trained with non-lazy TD learning.
引用
收藏
页码:37 / 74
页数:38
相关论文
共 50 条
  • [21] Implementing temporal-difference learning with the scaled conjugate gradient algorithm
    Falas, T
    Stafylopatis, A
    NEURAL PROCESSING LETTERS, 2005, 22 (03) : 361 - 375
  • [22] On the existence of fixed points for approximate value iteration and temporal-difference learning
    De Farias, DP
    Van Roy, B
    JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS, 2000, 105 (03) : 589 - 608
  • [23] On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning
    D. P. De Farias
    B. Van Roy
    Journal of Optimization Theory and Applications, 2000, 105 : 589 - 608
  • [24] On average versus discounted reward temporal-difference learning
    Tsitsiklis, JN
    Van Roy, B
    MACHINE LEARNING, 2002, 49 (2-3) : 179 - 191
  • [25] On Average Versus Discounted Reward Temporal-Difference Learning
    John N. Tsitsiklis
    Benjamin Van Roy
    Machine Learning, 2002, 49 : 179 - 191
  • [26] A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation
    Bhandari, Jalaj
    Russo, Daniel
    Singal, Raghav
    OPERATIONS RESEARCH, 2021, 69 (03) : 950 - 973
  • [27] Distributed multi-agent temporal-difference learning with full neighbor information
    Peng, Zhinan
    Hu, Jiangping
    Luo, Rui
    Ghosh, Bijoy K.
    CONTROL THEORY AND TECHNOLOGY, 2020, 18 (04) : 379 - 389
  • [28] Distributed multi-agent temporal-difference learning with full neighbor information
    Zhinan Peng
    Jiangping Hu
    Rui Luo
    Bijoy K. Ghosh
    Control Theory and Technology, 2020, 18 : 379 - 389
  • [29] Asymptotic analysis of temporal-difference learning algorithms with constant step-sizes
    Vladislav B. Tadić
    Machine Learning, 2006, 63 : 107 - 133
  • [30] Asymptotic analysis of temporal-difference learning algorithms with constant step-sizes
    Tadic, VB
    MACHINE LEARNING, 2006, 63 (02) : 107 - 133