Efficient Fine-Tuning of BERT Models on the Edge

被引：16

作者：

Vucetic, Danilo ^{[1
]}

Tayaranian, Mohammadreza ^{[1
]}

Ziaeefard, Maryam ^{[1
]}

Clark, James J. ^{[1
]}

Meyer, Brett H. ^{[1
]}

Gross, Warren J. ^{[1
]}

机构：

[1] McGill Univ, Dept Elect & Comp Engn, Montreal, PQ, Canada

来源：

2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22) | 2022年

关键词：

Transformers; BERT; DistilBERT; NLP; Language Models; Efficient Transfer Learning; Efficient Fine-Tuning; Memory Efficiency; Time Efficiency; Edge Machine Learning;

D O I：

10.1109/ISCAS48785.2022.9937567

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Resource-constrained devices are increasingly the deployment targets of machine learning applications. Static models, however, do not always suffice for dynamic environments. On-device training of models allows for quick adaptability to new scenarios. With the increasing size of deep neural networks, as noted with the likes of BERT and other natural language processing models, comes increased resource requirements, namely memory, computation, energy, and time. Furthermore, training is far more resource intensive than inference. Resource-constrained on-device learning is thus doubly difficult, especially with large BERT-like models. By reducing the memory usage of fine-tuning, pre-trained BERT models can become efficient enough to fine-tune on resource-constrained devices. We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERTlike models that reduces the memory usage of activation maps during fine-tuning by avoiding unnecessary parameter updates. FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%. More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.

引用

页码：1838 / 1842

页数：5

共 22 条

[1]

Brown TB, 2020, ADV NEUR IN, V33

[2]

Cai H., 2020, NeurIPS

[3]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[4] A Survey of On-Device Machine Learning: An Algorithms and Learning Theory Perspective [J].

Dhar, Sauptik ;

Guo, Junyao ;

Liu, Jiayi ;

Tripathi, Samarth ;

Kurup, Unmesh ;

Shah, Mohak .

ACM TRANSACTIONS ON INTERNET OF THINGS, 2021, 2 (03)

[5] Compressing Large-Scale Transformer-Based Models: A Case Study on BERT [J].

Ganesh, Prakhar ;

Chen, Yao ;

Lou, Xin ;

Khan, Mohammad Ali ;

Yang, Yin ;

Sajjad, Hassan ;

Nakov, Preslav ;

Chen, Deming ;

Winslett, Marianne .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 :1061-1080

[6]

Golub M., 2019, P MACH LEARN SYST 20

[7]

Gordon MA, 2020, 5TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2020), P143

[8] EIE: Efficient Inference Engine on Compressed Deep Neural Network [J].

Han, Song ;

Liu, Xingyu ;

Mao, Huizi ;

Pu, Jing ;

Pedram, Ardavan ;

Horowitz, Mark A. ;

Dally, William J. .

2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :243-254

[9]

Hennessy J. L., 2012, COMPUTER ARCHITECTUR, V5th, pB

[10]

Jiao XQ, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P4163

← 1 2 3 →