On the Last Iterate Convergence of Momentum Methods

被引：0

作者：

Li, Xiaoyu ^{[1
]}

Liu, Mingrui ^{[2
]}

Orabona, Francesco ^{[3
]}

机构：

[1] Boston Univ, Div Syst Engn, Boston, MA 02215 USA

[2] George Mason Univ, Dept Comp Sci, Fairfax, VA 22030 USA

[3] Boston Univ, Elect & Comp Engn, Boston, MA 02215 USA

来源：

INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 167 | 2022年 / 167卷

基金：

美国国家科学基金会;

关键词：

Convex Optimization; Momentum methods; Stochastic Optimization;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

SGD with Momentum (SGDM) is a widely used family of algorithms for large-scale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of Omega(ln T/root T) after T iterations. Based on this fact, we study a class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with increasing momentum and shrinking updates. For these algorithms, we show that the last iterate has optimal convergence O(1/root T) for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of T. Further, we show a variety of results for FTRL-based SGDM when used with adaptive stepsizes. Empirical results are shown as well.

引用

页数：19

共 43 条

[21] On the linear convergence of additive Schwarz methods for the p-Laplacian
Lee, Young-Ju
Park, Jongho
IMA JOURNAL OF NUMERICAL ANALYSIS, 2024,
[22] On the Convergence of Proximal Gradient Methods for Convex Simple Bilevel Optimization
Latafat, Puya
Themelis, Andreas
Villa, Silvia
Patrinos, Panagiotis
JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS, 2025, 204 (03)
[23] Randomized Methods for Computing Optimal Transport Without Regularization and Their Convergence Analysis
Xie, Yue
Wang, Zhongjian
Zhang, Zhiwen
JOURNAL OF SCIENTIFIC COMPUTING, 2024, 100 (02)
[24] Coordinate Descent Methods for DC Minimization: Optimality Conditions and Global Convergence
Yuan, Ganzhao
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 9, 2023, : 11034 - 11042
[25] CONVERGENCE RATES FOR DETERMINISTIC AND STOCHASTIC SUBGRADIENT METHODS WITHOUT LIPSCHITZ CONTINUITY
Grimmer, Benjamin
SIAM JOURNAL ON OPTIMIZATION, 2019, 29 (02) : 1350 - 1365
[26] Convergence Analysis of Primal Solutions in Dual First-order Methods
Lu, Jie
Johansson, Mikael
2013 IEEE 52ND ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2013, : 6861 - 6867
[27] Adaptive NAG Methods Based on AdaGrad and Its Optimal Individual Convergence
Long S.
Tao W.
Zhang Z.-D.
Tao Q.
Ruan Jian Xue Bao/Journal of Software, 2022, 33 (04): : 1231 - 1243
[28] Distributed stochastic gradient tracking methods with momentum acceleration for non-convex optimization
Gao, Juan
Liu, Xin-Wei
Dai, Yu-Hong
Huang, Yakui
Gu, Junhua
COMPUTATIONAL OPTIMIZATION AND APPLICATIONS, 2023, 84 (02) : 531 - 572
[29] Distributed stochastic gradient tracking methods with momentum acceleration for non-convex optimization
Juan Gao
Xin-Wei Liu
Yu-Hong Dai
Yakui Huang
Junhua Gu
Computational Optimization and Applications, 2023, 84 : 531 - 572
[30] Convergence of Distributed Stochastic Variance Reduced Methods Without Sampling Extra Data
Cen, Shicong
Zhang, Huishuai
Chi, Yuejie
Chen, Wei
Liu, Tie-Yan
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2020, 68 : 3976 - 3989

← 1 2 3 4 5 →