Temporal-difference learning with nonlinear function approximation: lazy training and mean field regimes

被引：0

作者：

Agazzi, Andrea ^{[1
]}

Lu, Jianfeng ^{[1
,2
,3
]}

机构：

[1] Duke Univ, Dept Math, Durham, NC 27708 USA

[2] Duke Univ, Dept Phys, Durham, NC 27708 USA

[3] Duke Univ, Dept Chem, Durham, NC 27708 USA

来源：

MATHEMATICAL AND SCIENTIFIC MACHINE LEARNING, VOL 145 | 2021年 / 145卷

关键词：

Reinforcement learning; neural networks; temporal-difference learning; mean-field; lazy training; REINFORCEMENT; GO;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We discuss the approximation of the value function for infinite-horizon discounted Markov Reward Processes (MRP) with wide neural networks trained with the Temporal-Difference (TD) learning algorithm. We first consider this problem under a certain scaling of the approximating function, leading to a regime called lazy training. In this regime, which arises naturally, implicit in the initialization of the neural network, the parameters of the model vary only slightly during the learning process, resulting in approximately linear behavior of the model. Both in the under- and over-parametrized frameworks, we prove exponential convergence to local, respectively global minimizers of the TD learning algorithm in the lazy training regime. We then compare the above scaling with the alternative mean-field scaling, where the approximately linear behavior of the model is lost. In this nonlinear, mean-field regime we prove that all fixed points of the dynamics in parameter space are global minimizers. We finally give examples of our convergence results in the case of models that diverge if trained with non-lazy TD learning.

引用

页码：37 / 74

页数：38

共 50 条

[21] Implementing temporal-difference learning with the scaled conjugate gradient algorithm
Falas, T
Stafylopatis, A
NEURAL PROCESSING LETTERS, 2005, 22 (03) : 361 - 375
[22] On the existence of fixed points for approximate value iteration and temporal-difference learning
De Farias, DP
Van Roy, B
JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS, 2000, 105 (03) : 589 - 608
[23] On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning
D. P. De Farias
B. Van Roy
Journal of Optimization Theory and Applications, 2000, 105 : 589 - 608
[24] On average versus discounted reward temporal-difference learning
Tsitsiklis, JN
Van Roy, B
MACHINE LEARNING, 2002, 49 (2-3) : 179 - 191
[25] On Average Versus Discounted Reward Temporal-Difference Learning
John N. Tsitsiklis
Benjamin Van Roy
Machine Learning, 2002, 49 : 179 - 191
[26] A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation
Bhandari, Jalaj
Russo, Daniel
Singal, Raghav
OPERATIONS RESEARCH, 2021, 69 (03) : 950 - 973
[27] Distributed multi-agent temporal-difference learning with full neighbor information
Peng, Zhinan
Hu, Jiangping
Luo, Rui
Ghosh, Bijoy K.
CONTROL THEORY AND TECHNOLOGY, 2020, 18 (04) : 379 - 389
[28] Distributed multi-agent temporal-difference learning with full neighbor information
Zhinan Peng
Jiangping Hu
Rui Luo
Bijoy K. Ghosh
Control Theory and Technology, 2020, 18 : 379 - 389
[29] Asymptotic analysis of temporal-difference learning algorithms with constant step-sizes
Vladislav B. Tadić
Machine Learning, 2006, 63 : 107 - 133
[30] Asymptotic analysis of temporal-difference learning algorithms with constant step-sizes
Tadic, VB
MACHINE LEARNING, 2006, 63 (02) : 107 - 133

← 1 2 3 4 5 →