AdaSAM: Boosting sharpness-aware minimization with adaptive learning rate and momentum for neural networks

被引:7
|
作者
Sun, Hao [1 ]
Shen, Li [2 ]
Zhong, Qihuang [3 ]
Ding, Liang [2 ]
Chen, Shixiang [4 ]
Sun, Jingwei [1 ]
Li, Jing [1 ]
Sun, Guangzhong [1 ]
Tao, Dacheng [5 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci, Hefei 230026, Anhui, Peoples R China
[2] JD com, Beijing, Peoples R China
[3] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Hubei, Peoples R China
[4] Univ Sci & Technol China, Sch Math Sci, Hefei 230026, Anhui, Peoples R China
[5] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
关键词
Sharpness-aware minimization; Adaptive learning rate; Non-convex optimization; Momentum acceleration; Linear speedup; CONVERGENCE;
D O I
10.1016/j.neunet.2023.10.044
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We root theoretically show that AdaSAM admits a O(1/ bT) convergence rate, which achieves linear speedup property with respect to mini-batch size b. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks and the synthetic task, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.
引用
收藏
页码:506 / 519
页数:14
相关论文
共 50 条
  • [21] Performance Enhancement of Adaptive Neural Networks Based on Learning Rate
    Zubair, Swaleha
    Singha, Anjani Kumar
    Pathak, Nitish
    Sharma, Neelam
    Urooj, Shabana
    Larguech, Samia Rabeh
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (01): : 2005 - 2019
  • [22] Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks
    Fan, Qinwei
    Wu, Wei
    Zurada, Jacek M.
    SPRINGERPLUS, 2016, 5
  • [23] An Adaptive Optimization Method Based on Learning Rate Schedule for Neural Networks
    Yi, Dokkyun
    Ji, Sangmin
    Park, Jieun
    APPLIED SCIENCES-BASEL, 2021, 11 (02): : 1 - 11
  • [24] Adaptive Learning-Rate Backpropagation Neural Network Algorithm Based on the Minimization of Mean-Square Deviation for Impulsive Noises
    Kim, Dong Woo
    Kim, Min Su
    Lee, Jaeho
    Park, Poogyeon
    IEEE ACCESS, 2020, 8 : 98018 - 98026
  • [25] Appropriate Learning Rates of Adaptive Learning Rate Optimization Algorithms for Training Deep Neural Networks
    Iiduka, Hideaki
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (12) : 13250 - 13261
  • [26] A Diffferential Adaptive Learning Rate Method for Back-Propagation Neural Networks
    Iranmanesh, Saeid
    NN'09: PROCEEDINGS OF THE 10TH WSEAS INTERNATIONAL CONFERENCE ON NEURAL NETWORKS, 2009, : 30 - 34
  • [27] Convergence of a Batch Gradient Algorithm with Adaptive Momentum for Neural Networks
    Shao, Hongmei
    Xu, Dongpo
    Zheng, Gaofeng
    NEURAL PROCESSING LETTERS, 2011, 34 (03) : 221 - 228
  • [28] Convergence of a Batch Gradient Algorithm with Adaptive Momentum for Neural Networks
    Hongmei Shao
    Dongpo Xu
    Gaofeng Zheng
    Neural Processing Letters, 2011, 34 : 221 - 228
  • [29] Neural Networks for Solving the Superposition Problem Using Approximation Method and Adaptive Learning Rate
    Dagba, Theophile K.
    Adanhounme, Villevo
    Adedjouma, Semiyou A.
    AGENT AND MULTI-AGENT SYSTEMS: TECHNOLOGIES AND APPLICATIONS, PT II, PROCEEDINGS, 2010, 6071 : 92 - +
  • [30] LALR: Theoretical and Experimental validation of Lipschitz Adaptive Learning Rate in Regression and Neural Networks
    Saha, Snehanshu
    Prashanth, Tejas
    Aralihalli, Suraj
    Basarkod, Sumedh
    Sudarshan, T. S. B.
    Dhavala, Soma S.
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,