Convergence Rates of Non-Convex Stochastic Gradient Descent Under a Generic Lojasiewicz Condition and Local Smoothness

被引：0

作者：

Scaman, Kevin ^{[1
,2
]}

Malherbe, Cedric ^{[3
]}

Dos Santos, Ludovic ^{[3
]}

机构：

[1] PSL Univ, Ecole Normale Super, DI ENS, CNRS,INRIA, Paris, France

[2] Huawei, Paris, France

[3] Huawei Noahs Ark, Paris, France

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Training over-parameterized neural networks involves the empirical minimization of highly non-convex objective functions. Recently, a large body of works provided theoretical evidence that, despite this non-convexity, properly initialized over-parameterized networks can converge to a zero training loss through the introduction of the Polyak-Lojasiewicz condition. However, these analyses are restricted to quadratic losses such as mean square error, and tend to indicate fast exponential convergence rates that are seldom observed in practice. In this work, we propose to extend these results by analyzing stochastic gradient descent under more generic Lojasiewicz conditions that are applicable to any convex loss function, thus extending the current theory to a larger panel of losses commonly used in practice such as cross-entropy. Moreover, our analysis provides high-probability bounds on the approximation error under sub-Gaussian gradient noise and only requires the local smoothness of the objective function, thus making it applicable to deep neural networks in realistic settings.

引用

页码：19310 / 19327

页数：18

共 42 条

[1]

Allen-Zhu Z., 2019, Advances in Neural Information Processing Systems

[2] Convex Optimization: Algorithms and Complexity [J].

不详 .

FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2015, 8 (3-4) :232-+

[3]

[Anonymous], 2009, Rep. TR-2009

[4]

[Anonymous], 1963, Les equations aux derivees partielles, DOI DOI 10.1006/JDEQ.1997.3393

[5]

Arjevani Y, 2022, Arxiv, DOI arXiv:1912.02365

[6]

Arora S, 2018, PR MACH LEARN RES, V80

[7] Proximal Alternating Minimization and Projection Methods for Nonconvex Problems: An Approach Based on the Kurdyka-Lojasiewicz Inequality [J].

Attouch, Hedy ;

Bolte, Jerome ;

Redont, Patrick ;

Soubeyran, Antoine .

MATHEMATICS OF OPERATIONS RESEARCH, 2010, 35 (02) :438-457

[8] Lower bounds for finding stationary points I [J].

Carmon, Yair ;

Duchi, John C. ;

Hinder, Oliver ;

Sidford, Aaron .

MATHEMATICAL PROGRAMMING, 2020, 184 (1-2) :71-120

[9]

Chen Z., 2019, arXiv

[10]

Chizat L., 2019, Advances in neural information processing systems, V32, P2937

← 1 2 3 4 5 →