On the different regimes of stochastic gradient descent

被引：6

作者：

Sclocchi, Antonio ^{[1
]}

Wyart, Matthieu ^{[1
]}

机构：

[1] Ecole Polytech Fed Lausanne, Inst Phys, CH-1015 Lausanne, Switzerland

来源：

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA | 2023年 / 121卷 / 09期

关键词：

stochastic gradient descent; phase diagram; critical batch size; implicit bias;

D O I：

10.1073/pnas.2316301121

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size B , and the step size or learning rate ij . For small B and large ij , SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the "temperature" T equivalent to ij / B . Yet this description is observed to break down for sufficiently large batches B >= B * , or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the B- ij plane that separates three dynamical phases: i) a noise-dominated SGD governed by temperature, ii) a large-first-step-dominated SGD and iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size B * separating regimes (i) and (ii) scale with the size P of the training set, with an exponent that characterizes the hardness of the classification problem.

引用

页数：8

共 50 条

[1] Unforgeability in Stochastic Gradient Descent
Baluta, Teodora
Nikolic, Ivica
Jain, Racchit
Aggarwal, Divesh
Saxena, Prateek
PROCEEDINGS OF THE 2023 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, CCS 2023, 2023, : 1138 - 1152
[2] Preconditioned Stochastic Gradient Descent
Li, Xi-Lin
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (05) : 1454 - 1466
[3] Stochastic Reweighted Gradient Descent
El Hanchi, Ayoub
Stephens, David A.
Maddison, Chris J.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[4] Stochastic gradient descent tricks
Bottou, Léon
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012, 7700 LECTURE NO : 421 - 436
[5] Byzantine Stochastic Gradient Descent
Alistarh, Dan
Allen-Zhu, Zeyuan
Li, Jerry
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[6] Convergence of Stochastic Gradient Descent for PCA
Shamir, Ohad
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
[7] Stochastic Gradient Descent in Continuous Time
Sirignano, Justin
Spiliopoulos, Konstantinos
SIAM JOURNAL ON FINANCIAL MATHEMATICS, 2017, 8 (01): : 933 - 961
[8] Bayesian Distributed Stochastic Gradient Descent
Teng, Michael
Wood, Frank
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[9] On the Hyperparameters in Stochastic Gradient Descent with Momentum
Shi, Bin
JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25
[10] On the Generalization of Stochastic Gradient Descent with Momentum
Ramezani-Kebrya, Ali
Antonakopoulos, Kimon
Cevher, Volkan
Khisti, Ashish
Liang, Ben
JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25 : 1 - 56

← 1 2 3 4 5 →