Fine-Tuning Language Models with Just Forward Passes

被引：0

作者：

Malladi, Sadhika ^{[1
]}

Gao, Tianyu ^{[1
]}

Nichani, Eshaan ^{[1
]}

Damian, Alex ^{[1
]}

Lee, Jason D. ^{[1
]}

Chen, Danqi ^{[1
]}

Arora, Sanjeev ^{[1
]}

机构：

[1] Princeton Univ, Princeton, NJ 08544 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

美国国家科学基金会;

关键词：

STOCHASTIC-APPROXIMATION; PERTURBATION; OPTIMIZATION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zeroth-order optimizer (MeZO), adapting the classical ZO-SGD method to operate inplace, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.(1)

引用

页数：38

共 113 条

[61]

Liu LY, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P5747

[62]

Liu S, 2019, INT C LEARN REPR

[63] A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning Principals, recent advances, and applications [J].

Liu, Sijia ;

Chen, Pin-Yu ;

Kailkhura, Bhavya ;

Zhang, Gaoyuan ;

Hero, Alfred O., III ;

Varshney, Pramod K. .

IEEE SIGNAL PROCESSING MAGAZINE, 2020, 37 (05) :43-54

[64]

Liu SJ, 2018, ADV NEUR IN, V31

[65] RoBERTa: A Robustly Optimized BERT Pretraining Approach [J].

Liu, Yinhan ;

Ott, Myle ;

Goyal, Naman ;

Du, Jingfei ;

Joshi, Mandar ;

Chen, Danqi ;

Levy, Omer ;

Lewis, Mike ;

Zettlemoyer, Luke ;

Stoyanov, Veselin .

INFORMATION SYSTEMS RESEARCH, 2019,

[66]

Lu Y, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P8086

[67]

Malladi Sadhika, 2022, ARXIV221005643

[68] Random Gradient-Free Minimization of Convex Functions [J].

Nesterov, Yurii ;

Spokoiny, Vladimir .

FOUNDATIONS OF COMPUTATIONAL MATHEMATICS, 2017, 17 (02) :527-566

[69]

Oktay D., 2020, ARXIV200710412

[70]

OpenAI, 2023, Gpt-4 technical report

← 2 3 4 5 6 7 8 9 10 11 →