Neural Temporal-Difference Learning Converges to Global Optima

被引：0

作者：

Cai, Qi ^{[1
]}

Yang, Zhuoran ^{[2
]}

Lee, Jason D. ^{[3
]}

Wang, Zhaoran ^{[1
]}

机构：

[1] Northwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USA

[2] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA

[3] Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷

关键词：

ALGORITHMS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to non-convexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD.

引用

页数：12