Revisit last-iterate convergence of mSGD under milder requirement on step size

被引:0
作者
Jin, Ruinan [1 ]
He, Xingkang [2 ]
Chen, Lang [3 ]
Cheng, Difei [4 ]
Gupta, Vijay [2 ,5 ]
机构
[1] Chinese Univ Hong Kong, Shenzhen, Peoples R China
[2] Univ Notre Dame, Dept Elect Engn, Notre Dame, IN 46556 USA
[3] Shenyang Univ Chem Technol, Coll Informat Engn, Shenyang, Peoples R China
[4] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[5] Purdue Univ, Elmore Family Sch Elect & Comp Engn, W Lafayette, IN 47907 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年
关键词
MOMENTUM;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Understanding convergence of stochastic gradient descent (SGD) based optimization algorithms can help deal with enormous machine learning problems. To ensure last-iterate convergence of SGD and momentum-based SGD (mSGD), the existing studies usually constrain the step size epsilon(n) to decay as Sigma(+infinity.)(n=1)epsilon(2)(n) < +infinity, which however is rather conservative and may lead to slow convergence in the early stage of the iteration. In this paper, we relax this requirement by studying an alternate s Ptep size for the mSGD. First, we relax the requirement of the decay on step size to Sigma(+infinity.)(n=1)epsilon(2)(n) (+eta 0)< +infinity (0 <= eta(0) < 1/2). This implies that a larger step size, such as epsilon(n) = 1/root n can be utilized for accelerating the mSGD in the early stage. Under this new step size and some common conditions, we prove that the gradient norm of mSGD for a class of non-convex loss functions asymptotically decays to zero. In addition, we show that this step size can indeed help make the iterates of mSGD converge into a neighborhood of the stationary points quicker in the early stage. Finally, we establish the convergence of mSGD under a constant step size epsilon(n) = epsilon > 0 by removing a common requirement in the literature on strong convexity of the loss functions. Some experiments are given to illustrate the developed results.
引用
收藏
页数:12
相关论文
共 34 条
  • [1] A PID Controller Approach for Stochastic Optimization of Deep Networks
    An, Wangpeng
    Wang, Haoqian
    Sun, Qingyun
    Xu, Jun
    Dai, Qionghai
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8522 - 8531
  • [2] [Anonymous], 2018, Don't decay the learning rate, increase the batch size
  • [3] Berger O., 2004, STAT DECISION THEORY, DOI [10.1007/978-1-4757-4286-2, DOI 10.1007/978-1-4757-4286-2]
  • [4] Chen H.-F., 2002, STOCHASTIC APPROXIMA
  • [5] Cyrus S, 2018, P AMER CONTR CONF, P1376, DOI 10.23919/ACC.2018.8430824
  • [6] Gitman I, 2019, ADV NEUR IN, V32
  • [7] Gupal A. M., 1972, Cybernetics, V8, P138, DOI 10.1007/BF01069146
  • [8] Iyer N., 2021, LRTUNER LEARNING RAT
  • [9] Jin R., 2022, INT C LEARN REPR
  • [10] Kaniovskii Y. M., 1983, ZH VYCH MAT MAT FIZ, V23, P13