Fine-Tuning Language Models with Just Forward Passes

被引：0

作者：

Malladi, Sadhika ^{[1
]}

Gao, Tianyu ^{[1
]}

Nichani, Eshaan ^{[1
]}

Damian, Alex ^{[1
]}

Lee, Jason D. ^{[1
]}

Chen, Danqi ^{[1
]}

Arora, Sanjeev ^{[1
]}

机构：

[1] Princeton Univ, Princeton, NJ 08544 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

美国国家科学基金会;

关键词：

STOCHASTIC-APPROXIMATION; PERTURBATION; OPTIMIZATION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zeroth-order optimizer (MeZO), adapting the classical ZO-SGD method to operate inplace, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.(1)

引用

页数：38

共 113 条

[1]

Abdel-Khalik Hany S, 2008, Advances in Automatic Differentiation, P55, DOI [10.1007/978-3-540-68942-3_6, DOI 10.1007/978-3-540-68942-3_6]

[2]

Adelman Menachem, 2021, Advances in Neural Information Processing Systems, V34, P27877

[3] Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization [J].

Agarwal, Alekh ;

Bartlett, Peter L. ;

Ravikumar, Pradeep ;

Wainwright, Martin J. .

IEEE TRANSACTIONS ON INFORMATION THEORY, 2012, 58 (05) :3235-3249

[4]

Aghajanyan A., 2021, P 59 ANN M ASS COMPU, V1, P7319

[5]

[Anonymous], 2018, Advances in Neural Information Processing Systems

[6]

[Anonymous], 2022, INT C MACH LEARN

[7]

Bach Stephen H, 2022, ARXIV220201279

[8]

Balasubramanian K, 2018, ADV NEUR IN, V31

[9]

Balasubramanian Krishnakumar, 2022, Foundations of Computational Mathematics, P1

[10]

Bentivogli L., 2009, The fifth PASCAL recognizing textual entailment challenge, P1

← 1 2 3 4 5 6 7 8 9 10 →