Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers

被引:27
作者
Bah, Bubacarr [1 ]
Rauhut, Holger [2 ]
Terstiege, Ulrich [2 ]
Westdickenberg, Michael [3 ]
机构
[1] Stellenbosch Univ, Res Ctr, African Inst Math Sci AIMS South Africa, Dept Math Sci, Cape Town, Western Cape, South Africa
[2] Rhein Westfal TH Aachen, Chair Math Informat Proc, Pontdriesch 10, D-52062 Aachen, Germany
[3] Rhein Westfal TH Aachen, Inst Math, Templergraben 55, D-52062 Aachen, Germany
关键词
Riemannian; gradient flow; manifolds; deep learning; neural networks;
D O I
10.1093/imaiai/iaaa039
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
We study the convergence of gradient flows related to learning deep linear neural networks (where the activation function is the identity map) from data. In this case, the composition of the network layers amounts to simply multiplying the weight matrices of all layers together, resulting in an overparameterized problem. The gradient flow with respect to these factors can be re-interpreted as a Riemannian gradient flow on the manifold of rank-r matrices endowed with a suitable Riemannian metric. We show that the flow always converges to a critical point of the underlying functional. Moreover, we establish that, for almost all initializations, the flow converges to a global minimum on the manifold of rank k matrices for some k <= r.
引用
收藏
页码:307 / 353
页数:47
相关论文
共 22 条
[1]   Low-rank retractions: a survey and new results [J].
Absil, P. -A. ;
Oseledets, I. V. .
COMPUTATIONAL OPTIMIZATION AND APPLICATIONS, 2015, 62 (01) :5-29
[2]   Convergence of the iterates of descent methods for analytic cost functions [J].
Absil, PA ;
Mahony, R ;
Andrews, B .
SIAM JOURNAL ON OPTIMIZATION, 2005, 16 (02) :531-547
[3]  
Absil PA, 2008, OPTIMIZATION ALGORITHMS ON MATRIX MANIFOLDS, P1
[4]  
[Anonymous], 1996, Theorems on Regularity and Singularity of Energy Minimizing Maps
[5]  
Arora S, 2018, PR MACH LEARN RES, V80
[6]  
Banyaga A., 2013, LECT MORSE HOMOLOGY, V29
[7]  
Bhatia R., 2013, Matrix Analysis, V169
[8]   The operator equation Σi=0nAn-iXBi = Y [J].
Bhatia, Rajendra ;
Uchiyama, Mitsuru .
EXPOSITIONES MATHEMATICAE, 2009, 27 (03) :251-255
[9]   An Analog of the 2-Wasserstein Metric in Non-Commutative Probability Under Which the Fermionic Fokker-Planck Equation is Gradient Flow for the Entropy [J].
Carlen, Eric A. ;
Maas, Jan .
COMMUNICATIONS IN MATHEMATICAL PHYSICS, 2014, 331 (03) :887-926
[10]  
Chitour Y., 2018, ARXIV PREPRINT ARXIV