Stochastic momentum methods for non-convex learning without bounded assumptions

被引：10

作者：

Liang, Yuqing ^{[1
]}

Liu, Jinlan ^{[1
]}

Xu, Dongpo ^{[1
]}

机构：

[1] Northeast Normal Univ, Sch Math & Stat, Key Lab Appl Stat MOE, Changchun 130024, Peoples R China

来源：

NEURAL NETWORKS | 2023年 / 165卷

基金：

中国国家自然科学基金;

关键词：

Non-convex optimization; Last-iterate convergence rate; Stochastic momentum methods; PL condition; Machine learning; GRADIENT DESCENT; CONVERGENCE;

D O I：

10.1016/j.neunet.2023.06.021

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Stochastic momentum methods are widely used to solve stochastic optimization problems in machine learning. However, most of the existing theoretical analyses rely on either bounded assumptions or strong stepsize conditions. In this paper, we focus on a class of non-convex objective functions satisfying the Polyak-Lojasiewicz (PL) condition and present a unified convergence rate analysis for stochastic momentum methods without any bounded assumptions, which covers stochastic heavy ball (SHB) and stochastic Nesterov accelerated gradient (SNAG). Our analysis achieves the more challenging last-iterate convergence rate of function values under the relaxed growth (RG) condition, which is a weaker assumption than those used in related work. Specifically, we attain the sub-linear rate for stochastic momentum methods with diminishing stepsizes, and the linear convergence rate for constant stepsizes if the strong growth (SG) condition holds. We also examine the iteration complexity for obtaining an & epsilon;-accurate solution of the last-iterate. Moreover, we provide a more flexible stepsize scheme for stochastic momentum methods in three points: (i) relaxing the last-iterate convergence stepsize from square summable to zero limitation; (ii) extending the minimum-iterate convergence rate stepsize to the non-monotonic case; (iii) expanding the last-iterate convergence rate stepsize to a more general form. Finally, we conduct numerical experiments on benchmark datasets to validate our theoretical findings.& COPY; 2023 Elsevier Ltd. All rights reserved.

引用

页码：830 / 845

页数：16

共 61 条

[1]

Alexander Thomas S, 2012, Adaptive Signal Processing: Theory and Application

[2]

Altan A., 2019, Journal of Cognitive Systems, V4, P17, DOI DOI 10.1186/S41235-019-0167-2

[3]

[Anonymous], 2001, ADAPT LEARN SYST SIG

[4]

Bertsekas D.P., 1999, Nonlinear Programming

[5] Gradient convergence in gradient methods with errors [J].

Bertsekas, DP ;

Tsitsiklis, JN .

SIAM JOURNAL ON OPTIMIZATION, 2000, 10 (03) :627-642

[6] Optimization Methods for Large-Scale Machine Learning [J].

Bottou, Leon ;

Curtis, Frank E. ;

Nocedal, Jorge .

SIAM REVIEW, 2018, 60 (02) :223-311

[7]

Chen X., 2019, P INT C LEARN REPR, P1

[8]

Dozat T., 2016, ICLR WORKSH

[9]

Duchi J, 2011, J MACH LEARN RES, V12, P2121

[10]

Foster DJ, 2018, ADV NEUR IN, V31

← 1 2 3 4 5 6 7 →