Neural Temporal-Difference Learning Converges to Global Optima
被引:0
|
作者:
Cai, Qi
论文数: 0引用数: 0
h-index: 0
机构:
Northwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USANorthwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USA
Cai, Qi
[1
]
Yang, Zhuoran
论文数: 0引用数: 0
h-index: 0
机构:
Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USANorthwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USA
Yang, Zhuoran
[2
]
Lee, Jason D.
论文数: 0引用数: 0
h-index: 0
机构:
Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USANorthwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USA
Lee, Jason D.
[3
]
Wang, Zhaoran
论文数: 0引用数: 0
h-index: 0
机构:
Northwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USANorthwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USA
Wang, Zhaoran
[1
]
机构:
[1] Northwestern Univ, Dept Ind Engn & Management Sci, Evanston, IL 60208 USA
[2] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
[3] Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USA
来源:
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019)
|
2019年
/
32卷
关键词:
ALGORITHMS;
D O I:
暂无
中图分类号:
TP18 [人工智能理论];
学科分类号:
081104 ;
0812 ;
0835 ;
1405 ;
摘要:
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to non-convexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD.