Robust Training of Neural Networks Using Scale Invariant Architectures

被引:0
作者
Li, Zhiyuan [1 ,2 ]
Bhojanapalli, Srinadh [2 ]
Zaheer, Manzil [3 ]
Reddi, Sashank J. [2 ]
Kumar, Sanjiv [2 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
[2] Google Res New York, New York, NY 10011 USA
[3] Google DeepMind New York, New York, NY USA
来源
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162 | 2022年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In contrast to SGD, adaptive gradient methods like ADAM allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by root 2 lambda/eta, where eta is learning rate and lambda is weight decay. We show that this general approach is robust to rescaling of parameter and loss by proving that its convergence only depends logarithmically on the scale of initialization and loss, whereas the standard SGD might not even converge for many initializations. Following our recipe, we design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like ADAM on downstream tasks.
引用
收藏
页数:29
相关论文
共 40 条
  • [1] Anil R., 2019, ADV NEURAL INFORM PR, V32
  • [2] [Anonymous], 2017, L2 regularization versus batch and weight normalization
  • [3] [Anonymous], 2016, ARXIV161104831
  • [4] [Anonymous], 2015, Advances in neural information processing systems
  • [5] Arora S, 2018, INT C LEARN REPR, DOI 10.48550/arXiv.1706.08224
  • [6] Ba Jimmy Lei, 2016, LAYER NORMALIZATION, DOI 10.48550/arXiv.1607.06450
  • [7] Chen X., 2020, ABS200615429 CORR
  • [8] Chowdhery Aakanksha, 2022, J. Mach. Learn. Res.
  • [9] Cohen J., 2020, INT C LEARN REPR
  • [10] Devlin Jacob, 2018, ANN C N AM CHAPTER A