Sample Complexity and Overparameterization Bounds for Temporal-Difference Learning With Neural Network Approximation

被引：2

作者：

Cayci, Semih ^{[1
,2
]}

Satpathi, Siddhartha ^{[3
,4
]}

He, Niao ^{[5
]}

Srikant, R. ^{[1
,6
]}

机构：

[1] Univ Illinois, Coordinated Sci Lab, Urbana, IL 61801 USA

[2] Rhein Westfal TH Aachen, Chair Math Informat Proc, D-52062 Aachen, Germany

[3] Univ Illinois, Urbana, IL 61801 USA

[4] Mayo Clin, Rochester, MN 55902 USA

[5] Swiss Fed Inst Technol, Dept Comp Sci, CH-8006 Zurich, Switzerland

[6] Univ Illinois, Dept Elect & Comp Engn, Urbana, IL 61801 USA

来源：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL | 2023年 / 68卷 / 05期

基金：

瑞士国家科学基金会; 美国国家科学基金会;

关键词：

Neural networks; Approximation algorithms; Markov processes; Convergence; Complexity theory; Reinforcement learning; Kernel; reinforcement learning (RL); stochastic approximation; temporal-difference (TD) learning;

D O I：

10.1109/TAC.2023.3234234

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this article, we study the dynamics of temporal-difference (TD) learning with neural network-based value function approximation over a general state space, namely, neural TD learning. We consider two practically used algorithms, projection-free and max-norm regularized neural TD learning, and establish the first convergence bounds for these algorithms. An interesting observation from our results is that max-norm regularization can dramatically improve the performance of TD learning algorithms in terms of sample complexity and overparameterization. The results in this work rely on a Lyapunov drift analysis of the network parameters as a stopped and controlled random process.

引用

页码：2891 / 2905

页数：15

共 50 条

[31] A geospatial service composition approach based on MCTS with temporal-difference learning
Zhuang C.
Guo M.
Xie Z.
High Technology Letters, 2021, 27 (01) : 17 - 25
[32] On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning
D. P. De Farias
B. Van Roy
Journal of Optimization Theory and Applications, 2000, 105 : 589 - 608
[33] Optimization of music education strategy guided by the temporal-difference reinforcement learning algorithm
Su, Yingwei
Wang, Yuan
Soft Computing, 2024, 28 (13-14) : 8279 - 8291
[34] Distributed multi-agent temporal-difference learning with full neighbor information
Zhinan Peng
Jiangping Hu
Rui Luo
Bijoy K. Ghosh
Control Theory and Technology, 2020, 18 : 379 - 389
[35] Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach
Jia, Yanwei
Zhou, Xun Yu
JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
[36] Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling
Prashanth, L. A.
Korda, Nathaniel
Munos, Remi
MACHINE LEARNING, 2021, 110 (03) : 559 - 618
[37] Gradient temporal-difference learning for off-policy evaluation using emphatic weightings
Cao, Jiaqing
Liu, Quan
Zhu, Fei
Fu, Qiming
Zhong, Shan
INFORMATION SCIENCES, 2021, 580 : 311 - 330
[38] On sharpness of error bounds for multivariate neural network approximation
Steffen Goebbels
Ricerche di Matematica, 2022, 71 : 633 - 653
[39] On sharpness of error bounds for multivariate neural network approximation
Goebbels, Steffen
RICERCHE DI MATEMATICA, 2022, 71 (02) : 633 - 653
[40] Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control
Jiaqing Cao
Quan Liu
Lan Wu
Qiming Fu
Shan Zhong
Applied Intelligence, 2023, 53 : 20917 - 20937

← 1 2 3 4 5 →