Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

被引:78
作者
Ma, Cong [1 ]
Wang, Kaizheng [1 ]
Ch, Yuejie [2 ]
Chen, Yuxin [3 ]
机构
[1] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
[2] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA
[3] Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USA
关键词
Nonconvex optimization; Gradient descent; Leave-one-out analysis; Phase retrieval; Matrix completion; Blind deconvolution; PERTURBATION BOUNDS; STRUCTURED SIGNAL; OPTIMAL RATES; RECOVERY; ALGORITHM; NORM; OPTIMIZATION; INCOHERENCE; CONVEXITY; ROTATION;
D O I
10.1007/s10208-019-09429-9
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g., trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This "implicit regularization" feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e., phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a by-product, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control-measured entrywise and by the spectral norm-which might be of independent interest.
引用
收藏
页码:451 / 632
页数:182
相关论文
共 133 条
[91]  
Li YJ, 2017, 2017 INTERNATIONAL CONFERENCE ON SAMPLING THEORY AND APPLICATIONS (SAMPTA), P119, DOI 10.1109/SAMPTA.2017.8024422
[92]  
Lin JH, 2016, PR MACH LEARN RES, V48
[93]   Regularized gradient descent: a non-convex recipe for fast joint blind deconvolution and demixing [J].
Ling, Shuyang ;
Strohmer, Thomas .
INFORMATION AND INFERENCE-A JOURNAL OF THE IMA, 2019, 8 (01) :1-49
[94]   Self-calibration and biconvex compressive sensing [J].
Ling, Shuyang ;
Strohmer, Thomas .
INVERSE PROBLEMS, 2015, 31 (11)
[95]   THE SPECTRAL NORM OF A NONNEGATIVE MATRIX [J].
MATHIAS, R .
LINEAR ALGEBRA AND ITS APPLICATIONS, 1990, 139 :269-284
[96]   PERTURBATION BOUNDS FOR THE POLAR DECOMPOSITION [J].
MATHIAS, R .
SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS, 1993, 14 (02) :588-597
[97]  
Maunu T, 2019, J MACH LEARN RES, V20
[98]   THE LANDSCAPE OF EMPIRICAL RISK FOR NONCONVEX LOSSES [J].
Mei, Song ;
Bai, Yu ;
Montanari, Andrea .
ANNALS OF STATISTICS, 2018, 46 (06) :2747-2774
[99]  
Negahban S, 2012, J MACH LEARN RES, V13, P1665
[100]  
Recht B, 2011, J MACH LEARN RES, V12, P3413