AdaSAM: Boosting sharpness-aware minimization with adaptive learning rate and momentum for neural networks

被引:7
|
作者
Sun, Hao [1 ]
Shen, Li [2 ]
Zhong, Qihuang [3 ]
Ding, Liang [2 ]
Chen, Shixiang [4 ]
Sun, Jingwei [1 ]
Li, Jing [1 ]
Sun, Guangzhong [1 ]
Tao, Dacheng [5 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci, Hefei 230026, Anhui, Peoples R China
[2] JD com, Beijing, Peoples R China
[3] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Hubei, Peoples R China
[4] Univ Sci & Technol China, Sch Math Sci, Hefei 230026, Anhui, Peoples R China
[5] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
关键词
Sharpness-aware minimization; Adaptive learning rate; Non-convex optimization; Momentum acceleration; Linear speedup; CONVERGENCE;
D O I
10.1016/j.neunet.2023.10.044
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We root theoretically show that AdaSAM admits a O(1/ bT) convergence rate, which achieves linear speedup property with respect to mini-batch size b. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks and the synthetic task, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.
引用
收藏
页码:506 / 519
页数:14
相关论文
共 50 条
  • [41] Improving the Performance of Multilayer Backpropagation Neural Networks with Adaptive Leaning Rate
    Amiri, Zahra
    Hassanpour, Hamid
    Khan, N. Mamode
    Khan, M. Heenaye Mamode
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN BIG DATA, COMPUTING AND DATA COMMUNICATION SYSTEMS (ICABCD), 2018,
  • [42] Online learning based on adaptive learning rate for a class of recurrent fuzzy neural network
    Khater, A. Aziz
    El-Nagar, Ahmad M.
    El-Bardini, Mohammad
    El-Rabaie, Nabila M.
    NEURAL COMPUTING & APPLICATIONS, 2020, 32 (12) : 8691 - 8710
  • [43] A Novel Adaptive Learning Rate Algorithm for Convolutional Neural Network Training
    Georgakopoulos, S., V
    Plagianakos, V. P.
    ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2017, 2017, 744 : 327 - 336
  • [44] A Novel Method for Detection of Fraudulent Bank Transactions using Multi-Layer Neural Networks with Adaptive Learning Rate
    Faridpour, Maryam
    Moradi, Alireza
    INTERNATIONAL JOURNAL OF NONLINEAR ANALYSIS AND APPLICATIONS, 2020, 11 (02): : 437 - 445
  • [45] Online learning based on adaptive learning rate for a class of recurrent fuzzy neural network
    A. Aziz Khater
    Ahmad M. El-Nagar
    Mohammad El-Bardini
    Nabila M. El-Rabaie
    Neural Computing and Applications, 2020, 32 : 8691 - 8710
  • [46] Deep Learning-Based Total Kidney Volume Segmentation in Autosomal Dominant Polycystic Kidney Disease Using Attention, Cosine Loss, and Sharpness Aware Minimization
    Raj, Anish
    Tollens, Fabian
    Hansen, Laura
    Golla, Alena-Kathrin
    Schad, Lothar R.
    Noerenberg, Dominik
    Zoellner, Frank G.
    DIAGNOSTICS, 2022, 12 (05)
  • [47] A Novel Learning Rate Schedule in Optimization for Neural Networks and It's Convergence
    Park, Jieun
    Yi, Dokkyun
    Ji, Sangmin
    SYMMETRY-BASEL, 2020, 12 (04):
  • [48] Beyond Back-Propagation Learning for Diabetic Detection: Convergence Comparison of Gradient Descent, Momentum and Adaptive Learning Rate
    Endah, Sukmawati N.
    Widodo, Aris P.
    Fariq, Muhammad L.
    Nadianada, Shavira I.
    Maulana, Fadil
    2017 1ST INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTATIONAL SCIENCES (ICICOS), 2017, : 189 - 193
  • [49] Hermite broad-learning recurrent neural control with adaptive learning rate for nonlinear systems
    Hsu, Chun-Fei
    Chen, Bo-Rui
    SOFT COMPUTING, 2023, 28 (7-8) : 6307 - 6326
  • [50] Reinforcement radial basis function neural networks with an adaptive annealing learning algorithm
    Ko, Chia-Nan
    APPLIED MATHEMATICS AND COMPUTATION, 2013, 221 : 503 - 513