On the different regimes of stochastic gradient descent

被引:6
|
作者
Sclocchi, Antonio [1 ]
Wyart, Matthieu [1 ]
机构
[1] Ecole Polytech Fed Lausanne, Inst Phys, CH-1015 Lausanne, Switzerland
关键词
stochastic gradient descent; phase diagram; critical batch size; implicit bias;
D O I
10.1073/pnas.2316301121
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size B , and the step size or learning rate ij . For small B and large ij , SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the "temperature" T equivalent to ij / B . Yet this description is observed to break down for sufficiently large batches B >= B * , or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the B- ij plane that separates three dynamical phases: i) a noise-dominated SGD governed by temperature, ii) a large-first-step-dominated SGD and iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size B * separating regimes (i) and (ii) scale with the size P of the training set, with an exponent that characterizes the hardness of the classification problem.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Unforgeability in Stochastic Gradient Descent
    Baluta, Teodora
    Nikolic, Ivica
    Jain, Racchit
    Aggarwal, Divesh
    Saxena, Prateek
    PROCEEDINGS OF THE 2023 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, CCS 2023, 2023, : 1138 - 1152
  • [2] Preconditioned Stochastic Gradient Descent
    Li, Xi-Lin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (05) : 1454 - 1466
  • [3] Stochastic Reweighted Gradient Descent
    El Hanchi, Ayoub
    Stephens, David A.
    Maddison, Chris J.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [4] Stochastic gradient descent tricks
    Bottou, Léon
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012, 7700 LECTURE NO : 421 - 436
  • [5] Byzantine Stochastic Gradient Descent
    Alistarh, Dan
    Allen-Zhu, Zeyuan
    Li, Jerry
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [6] Convergence of Stochastic Gradient Descent for PCA
    Shamir, Ohad
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [7] Stochastic Gradient Descent in Continuous Time
    Sirignano, Justin
    Spiliopoulos, Konstantinos
    SIAM JOURNAL ON FINANCIAL MATHEMATICS, 2017, 8 (01): : 933 - 961
  • [8] Bayesian Distributed Stochastic Gradient Descent
    Teng, Michael
    Wood, Frank
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [9] On the Hyperparameters in Stochastic Gradient Descent with Momentum
    Shi, Bin
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25
  • [10] On the Generalization of Stochastic Gradient Descent with Momentum
    Ramezani-Kebrya, Ali
    Antonakopoulos, Kimon
    Cevher, Volkan
    Khisti, Ashish
    Liang, Ben
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25 : 1 - 56