Sample Complexity and Overparameterization Bounds for Temporal-Difference Learning With Neural Network Approximation

被引:2
|
作者
Cayci, Semih [1 ,2 ]
Satpathi, Siddhartha [3 ,4 ]
He, Niao [5 ]
Srikant, R. [1 ,6 ]
机构
[1] Univ Illinois, Coordinated Sci Lab, Urbana, IL 61801 USA
[2] Rhein Westfal TH Aachen, Chair Math Informat Proc, D-52062 Aachen, Germany
[3] Univ Illinois, Urbana, IL 61801 USA
[4] Mayo Clin, Rochester, MN 55902 USA
[5] Swiss Fed Inst Technol, Dept Comp Sci, CH-8006 Zurich, Switzerland
[6] Univ Illinois, Dept Elect & Comp Engn, Urbana, IL 61801 USA
基金
瑞士国家科学基金会; 美国国家科学基金会;
关键词
Neural networks; Approximation algorithms; Markov processes; Convergence; Complexity theory; Reinforcement learning; Kernel; reinforcement learning (RL); stochastic approximation; temporal-difference (TD) learning;
D O I
10.1109/TAC.2023.3234234
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this article, we study the dynamics of temporal-difference (TD) learning with neural network-based value function approximation over a general state space, namely, neural TD learning. We consider two practically used algorithms, projection-free and max-norm regularized neural TD learning, and establish the first convergence bounds for these algorithms. An interesting observation from our results is that max-norm regularization can dramatically improve the performance of TD learning algorithms in terms of sample complexity and overparameterization. The results in this work rely on a Lyapunov drift analysis of the network parameters as a stopped and controlled random process.
引用
收藏
页码:2891 / 2905
页数:15
相关论文
共 50 条
  • [31] A geospatial service composition approach based on MCTS with temporal-difference learning
    Zhuang C.
    Guo M.
    Xie Z.
    High Technology Letters, 2021, 27 (01) : 17 - 25
  • [32] On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning
    D. P. De Farias
    B. Van Roy
    Journal of Optimization Theory and Applications, 2000, 105 : 589 - 608
  • [33] Optimization of music education strategy guided by the temporal-difference reinforcement learning algorithm
    Su, Yingwei
    Wang, Yuan
    Soft Computing, 2024, 28 (13-14) : 8279 - 8291
  • [34] Distributed multi-agent temporal-difference learning with full neighbor information
    Zhinan Peng
    Jiangping Hu
    Rui Luo
    Bijoy K. Ghosh
    Control Theory and Technology, 2020, 18 : 379 - 389
  • [35] Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach
    Jia, Yanwei
    Zhou, Xun Yu
    JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
  • [36] Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling
    Prashanth, L. A.
    Korda, Nathaniel
    Munos, Remi
    MACHINE LEARNING, 2021, 110 (03) : 559 - 618
  • [37] Gradient temporal-difference learning for off-policy evaluation using emphatic weightings
    Cao, Jiaqing
    Liu, Quan
    Zhu, Fei
    Fu, Qiming
    Zhong, Shan
    INFORMATION SCIENCES, 2021, 580 : 311 - 330
  • [38] On sharpness of error bounds for multivariate neural network approximation
    Steffen Goebbels
    Ricerche di Matematica, 2022, 71 : 633 - 653
  • [39] On sharpness of error bounds for multivariate neural network approximation
    Goebbels, Steffen
    RICERCHE DI MATEMATICA, 2022, 71 (02) : 633 - 653
  • [40] Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control
    Jiaqing Cao
    Quan Liu
    Lan Wu
    Qiming Fu
    Shan Zhong
    Applied Intelligence, 2023, 53 : 20917 - 20937