Neural Temporal-Difference Learning Converges to Global Optima

被引:0
|
作者
Cai, Qi [1 ]
Yang, Zhuoran [2 ]
Lee, Jason D. [3 ]
Wang, Zhaoran [1 ]
机构
[1] Northwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USA
[2] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
[3] Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷
关键词
ALGORITHMS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to non-convexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD.
引用
收藏
页数:12
相关论文
empty
未找到相关数据