Fine-Tuning Language Models with Just Forward Passes

被引:0
作者
Malladi, Sadhika [1 ]
Gao, Tianyu [1 ]
Nichani, Eshaan [1 ]
Damian, Alex [1 ]
Lee, Jason D. [1 ]
Chen, Danqi [1 ]
Arora, Sanjeev [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
基金
美国国家科学基金会;
关键词
STOCHASTIC-APPROXIMATION; PERTURBATION; OPTIMIZATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zeroth-order optimizer (MeZO), adapting the classical ZO-SGD method to operate inplace, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.(1)
引用
收藏
页数:38
相关论文
共 113 条
[1]  
Abdel-Khalik Hany S, 2008, Advances in Automatic Differentiation, P55, DOI [10.1007/978-3-540-68942-3_6, DOI 10.1007/978-3-540-68942-3_6]
[2]  
Adelman Menachem, 2021, Advances in Neural Information Processing Systems, V34, P27877
[3]   Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization [J].
Agarwal, Alekh ;
Bartlett, Peter L. ;
Ravikumar, Pradeep ;
Wainwright, Martin J. .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2012, 58 (05) :3235-3249
[4]  
Aghajanyan A., 2021, P 59 ANN M ASS COMPU, V1, P7319
[5]  
[Anonymous], 2018, Advances in Neural Information Processing Systems
[6]  
[Anonymous], 2022, INT C MACH LEARN
[7]  
Bach Stephen H, 2022, ARXIV220201279
[8]  
Balasubramanian K, 2018, ADV NEUR IN, V31
[9]  
Balasubramanian Krishnakumar, 2022, Foundations of Computational Mathematics, P1
[10]  
Bentivogli L., 2009, The fifth PASCAL recognizing textual entailment challenge, P1