Theoretical issues in deep networks

被引:94
作者
Poggio, Tomaso [1 ]
Banburski, Andrzej [1 ]
Liao, Qianli [1 ]
机构
[1] MIT, Ctr Brains Minds & Machines, 77 Massachusetts Ave, Cambridge, MA 02139 USA
基金
美国国家科学基金会;
关键词
machine learning; deep learning; approximation; optimization; generalization; NEURAL-NETWORKS; APPROXIMATION;
D O I
10.1073/pnas.1907369117
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization, and good out-of-sample performance, despite overparameterization and the absence of explicit regularization. We review our recent results toward this goal. In approximation theory both shallow and deep networks are known to approximate any continuous functions at an exponential cost. However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality. In characterizing minimization of the empirical exponential loss we consider the gradient flow of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to normalized networks. The dynamics of normalized weights turn out to be equivalent to those of the constrained problem of minimizing the loss subject to a unit norm constraint. In particular, the dynamics of typical gradient descent have the same critical points as the constrained problem. Thus there is implicit regularization in training deep networks under exponential-type loss functions during gradient flow. As a consequence, the critical points correspond to minimum norm infima of the loss. This result is especially relevant because it has been recently shown that, for overparameterized models, selection of a minimum norm solution optimizes cross-validation leave-one-out stability and thereby the expected error. Thus our results imply that gradient descent in deep networks minimize the expected error.
引用
收藏
页码:30039 / 30045
页数:7
相关论文
共 43 条
[1]  
Allen-Zhu Z., 2018, ARXIV181104918
[2]  
[Anonymous], 2016, ARXIV161001145
[3]  
[Anonymous], 2017, ARXIV170905289
[4]  
[Anonymous], 2003, Notes of the American Mathematical Society, DOI DOI 10.1109/ICNNB.2005.1614546
[5]  
[Anonymous], 2000, AMS C MATH CHALL 21
[6]  
[Anonymous], 2019, ARXIV190108584
[7]  
Anselmi F., 2015, DEEP CONVOLUTIONAL N
[8]  
Arscott FM., 1988, DIFFERENTIAL EQUATIO
[9]  
Banburski A., 2019, THEORY DEEP LEARNING
[10]   Limitations of the approximation capabilities of neural networks with one hidden layer [J].
Chui, CK ;
Li, X ;
Mhaskar, HN .
ADVANCES IN COMPUTATIONAL MATHEMATICS, 1996, 5 (2-3) :233-243