AdaSAM: Boosting sharpness-aware minimization with adaptive learning rate and momentum for neural networks

被引：7

作者：

Sun, Hao ^{[1
]}

Shen, Li ^{[2
]}

Zhong, Qihuang ^{[3
]}

Ding, Liang ^{[2
]}

Chen, Shixiang ^{[4
]}

Sun, Jingwei ^{[1
]}

Li, Jing ^{[1
]}

Sun, Guangzhong ^{[1
]}

Tao, Dacheng ^{[5
]}

机构：

[1] Univ Sci & Technol China, Sch Comp Sci, Hefei 230026, Anhui, Peoples R China

[2] JD com, Beijing, Peoples R China

[3] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Hubei, Peoples R China

[4] Univ Sci & Technol China, Sch Math Sci, Hefei 230026, Anhui, Peoples R China

[5] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia

来源：

NEURAL NETWORKS | 2024年 / 169卷

关键词：

Sharpness-aware minimization; Adaptive learning rate; Non-convex optimization; Momentum acceleration; Linear speedup; CONVERGENCE;

D O I：

10.1016/j.neunet.2023.10.044

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We root theoretically show that AdaSAM admits a O(1/ bT) convergence rate, which achieves linear speedup property with respect to mini-batch size b. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks and the synthetic task, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.

引用

页码：506 / 519

页数：14

共 50 条

[1] Convergence of Sharpness-Aware Minimization with Momentum
Pham Duy Khanh
Luong, Hoang-Chau
Mordukhovich, Boris S.
Dat Ba Tran
Truc Vo
INFORMATION TECHNOLOGIES AND THEIR APPLICATIONS, PT II, ITTA 2024, 2025, 2226 : 123 - 132
[2] FedGAMMA: Federated Learning With Global Sharpness-Aware Minimization
Dai, Rong
Yang, Xun
Sun, Yan
Shen, Li
Tian, Xinmei
Wang, Meng
Zhang, Yongdong
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (12) : 17479 - 17492
[3] Enhancing Sharpness-Aware Minimization by Learning Perturbation Radius
Wang, Xuehao
Jiang, Weisen
Fu, Shuai
Zhang, Yu
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, PT II, ECML PKDD 2024, 2024, 14942 : 375 - 391
[4] Sharpness-Aware Minimization and the Edge of Stability
Long, Philip M.
Bartlett, Peter L.
JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25 : 1 - 20
[5] Noise-resistant sharpness-aware minimization in deep learning
Su, Dan
Jin, Long
Wang, Jun
NEURAL NETWORKS, 2025, 181
[6] Implicit Sharpness-Aware Minimization for Domain Generalization
Dong, Mingrong
Yang, Yixuan
Zeng, Kai
Wang, Qingwang
Shen, Tao
REMOTE SENSING, 2024, 16 (16)
[7] Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term
Yue, Yun
Jiang, Jiadi
Ye, Zhiling
Gao, Ning
Liu, Yongchao
Zhang, Ke
PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 3185 - 3194
[8] Boosting sharpness-aware training with dynamic neighborhood
Chen, Junhong
Li, Hong
Chen, C. L. Philip
PATTERN RECOGNITION, 2024, 153
[9] Sharpness-Aware Minimization Leads to Better Robustness in Meta-learning
Xu, Mengke
Wang, Huiwei
2023 15TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTATIONAL INTELLIGENCE, ICACI, 2023,
[10] Generalizable Prompt Learning via Gradient Constrained Sharpness-Aware Minimization
Liu, Liangchen
Wang, Nannan
Zhou, Dawei
Liu, Decheng
Yang, Xi
Gao, Xinbo
Liu, Tongliang
IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 1100 - 1113

← 1 2 3 4 5 →