Neural Temporal-Difference Learning Converges to Global Optima

被引:0
|
作者
Cai, Qi [1 ]
Yang, Zhuoran [2 ]
Lee, Jason D. [3 ]
Wang, Zhaoran [1 ]
机构
[1] Northwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USA
[2] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
[3] Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷
关键词
ALGORITHMS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to non-convexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Neural Temporal Difference and Q Learning Provably Converge to Global Optima
    Cai, Qi
    Yang, Zhuoran
    Lee, Jason D.
    Wang, Zhaoran
    MATHEMATICS OF OPERATIONS RESEARCH, 2024, 49 (01) : 619 - 651
  • [2] Temporal-difference learning and applications in finance
    Van Roy, B
    COMPUTATIONAL FINANCE 1999, 2000, : 447 - 461
  • [3] Average cost temporal-difference learning
    Tsitsiklis, JN
    Van Roy, B
    PROCEEDINGS OF THE 36TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 1997, : 498 - 502
  • [4] True Online Temporal-Difference Learning
    van Seijen, Harm
    Mahmood, A. Rupam
    Pilarski, Patrick M.
    Machado, Marlos C.
    Sutton, Richard S.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
  • [5] Average cost temporal-difference learning
    Tsitsiklis, JN
    Van Roy, B
    AUTOMATICA, 1999, 35 (11) : 1799 - 1808
  • [6] Average cost temporal-difference learning
    Lab. for Info. and Decision Systems, Massachusetts Inst. of Technology, Room 35-209, 77 Massachusetts Avenue, Cambridge, MA 02139-4307, United States
    Automatica, 11 (1799-1808):
  • [7] An Analysis of Quantile Temporal-Difference Learning
    Rowland, Mark
    Munos, Remi
    Azar, Mohammad Gheshlaghi
    Tang, Yunhao
    Ostrovski, Georg
    Harutyunyan, Anna
    Tuyls, Karl
    Bellemare, Marc G.
    Dabney, Will
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25
  • [8] The serial blocking effect: a testbed for the neural mechanisms of temporal-difference learning
    Ashraf Mahmud
    Petio Petrov
    Guillem R. Esber
    Mihaela D. Iordanova
    Scientific Reports, 9
  • [9] The serial blocking effect: a testbed for the neural mechanisms of temporal-difference learning
    Mahmud, Ashraf
    Petrov, Petio
    Esber, Guillem R.
    Iordanova, Mihaela D.
    SCIENTIFIC REPORTS, 2019, 9 (1)
  • [10] Temporal-Difference Learning for Online Reachability Analysis
    Akametalu, Anayo K.
    Tomlin, Claire J.
    2015 EUROPEAN CONTROL CONFERENCE (ECC), 2015, : 2508 - 2513