Fine-Tuning Language Models with Just Forward Passes

被引:0
作者
Malladi, Sadhika [1 ]
Gao, Tianyu [1 ]
Nichani, Eshaan [1 ]
Damian, Alex [1 ]
Lee, Jason D. [1 ]
Chen, Danqi [1 ]
Arora, Sanjeev [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
基金
美国国家科学基金会;
关键词
STOCHASTIC-APPROXIMATION; PERTURBATION; OPTIMIZATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zeroth-order optimizer (MeZO), adapting the classical ZO-SGD method to operate inplace, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.(1)
引用
收藏
页数:38
相关论文
共 113 条
[61]  
Liu LY, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P5747
[62]  
Liu S, 2019, INT C LEARN REPR
[63]   A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning Principals, recent advances, and applications [J].
Liu, Sijia ;
Chen, Pin-Yu ;
Kailkhura, Bhavya ;
Zhang, Gaoyuan ;
Hero, Alfred O., III ;
Varshney, Pramod K. .
IEEE SIGNAL PROCESSING MAGAZINE, 2020, 37 (05) :43-54
[64]  
Liu SJ, 2018, ADV NEUR IN, V31
[65]   RoBERTa: A Robustly Optimized BERT Pretraining Approach [J].
Liu, Yinhan ;
Ott, Myle ;
Goyal, Naman ;
Du, Jingfei ;
Joshi, Mandar ;
Chen, Danqi ;
Levy, Omer ;
Lewis, Mike ;
Zettlemoyer, Luke ;
Stoyanov, Veselin .
INFORMATION SYSTEMS RESEARCH, 2019,
[66]  
Lu Y, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P8086
[67]  
Malladi Sadhika, 2022, ARXIV221005643
[68]   Random Gradient-Free Minimization of Convex Functions [J].
Nesterov, Yurii ;
Spokoiny, Vladimir .
FOUNDATIONS OF COMPUTATIONAL MATHEMATICS, 2017, 17 (02) :527-566
[69]  
Oktay D., 2020, ARXIV200710412
[70]  
OpenAI, 2023, Gpt-4 technical report