Revisit last-iterate convergence of mSGD under milder requirement on step size

被引：0

作者：

Jin, Ruinan ^{[1
]}

He, Xingkang ^{[2
]}

Chen, Lang ^{[3
]}

Cheng, Difei ^{[4
]}

Gupta, Vijay ^{[2
,5
]}

机构：

[1] Chinese Univ Hong Kong, Shenzhen, Peoples R China

[2] Univ Notre Dame, Dept Elect Engn, Notre Dame, IN 46556 USA

[3] Shenyang Univ Chem Technol, Coll Informat Engn, Shenyang, Peoples R China

[4] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

[5] Purdue Univ, Elmore Family Sch Elect & Comp Engn, W Lafayette, IN 47907 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

关键词：

MOMENTUM;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Understanding convergence of stochastic gradient descent (SGD) based optimization algorithms can help deal with enormous machine learning problems. To ensure last-iterate convergence of SGD and momentum-based SGD (mSGD), the existing studies usually constrain the step size epsilon(n) to decay as Sigma(+infinity.)(n=1)epsilon(2)(n) < +infinity, which however is rather conservative and may lead to slow convergence in the early stage of the iteration. In this paper, we relax this requirement by studying an alternate s Ptep size for the mSGD. First, we relax the requirement of the decay on step size to Sigma(+infinity.)(n=1)epsilon(2)(n) (+eta 0)< +infinity (0 <= eta(0) < 1/2). This implies that a larger step size, such as epsilon(n) = 1/root n can be utilized for accelerating the mSGD in the early stage. Under this new step size and some common conditions, we prove that the gradient norm of mSGD for a class of non-convex loss functions asymptotically decays to zero. In addition, we show that this step size can indeed help make the iterates of mSGD converge into a neighborhood of the stationary points quicker in the early stage. Finally, we establish the convergence of mSGD under a constant step size epsilon(n) = epsilon > 0 by removing a common requirement in the literature on strong convexity of the loss functions. Some experiments are given to illustrate the developed results.

引用

页数：12

共 34 条

[1] A PID Controller Approach for Stochastic Optimization of Deep Networks
An, Wangpeng
Wang, Haoqian
Sun, Qingyun
Xu, Jun
Dai, Qionghai
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8522 - 8531
[2] [Anonymous], 2018, Don't decay the learning rate, increase the batch size
[3] Berger O., 2004, STAT DECISION THEORY, DOI [10.1007/978-1-4757-4286-2, DOI 10.1007/978-1-4757-4286-2]
[4] Chen H.-F., 2002, STOCHASTIC APPROXIMA
[5] Cyrus S, 2018, P AMER CONTR CONF, P1376, DOI 10.23919/ACC.2018.8430824
[6] Gitman I, 2019, ADV NEUR IN, V32
[7] Gupal A. M., 1972, Cybernetics, V8, P138, DOI 10.1007/BF01069146
[8] Iyer N., 2021, LRTUNER LEARNING RAT
[9] Jin R., 2022, INT C LEARN REPR
[10] Kaniovskii Y. M., 1983, ZH VYCH MAT MAT FIZ, V23, P13

← 1 2 3 4 →