ARC: A Layer Replacement Compression Method Based on Fine-Grained Self-Attention Distillation for Compressing Pre-Trained Language Models

被引:0
作者
Yu, Daohan [1 ]
Qiu, Liqing [1 ]
机构
[1] Shandong Univ Sci & Technol, Qingdao 266590, Peoples R China
来源
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2025年 / 9卷 / 01期
关键词
Computational modeling; Training; Transformers; Task analysis; Accuracy; Vectors; Probability distribution; Knowledge distillation; model compression; natural language processing; transfer learning;
D O I
10.1109/TETCI.2024.3418837
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The primary objective of model compression is to maintain the performance of the original model while reducing its size as much as possible. Knowledge distillation has become the mainstream method in the field of model compression due to its excellent performance. However, current knowledge distillation methods for medium and small pre-trained models struggle to effectively extract knowledge from large pre-trained models. Similarly, methods targeting large pre-trained models face challenges in compressing the model to a smaller scale. Therefore, this paper proposes a new model compression method called Attention-based Replacement Compression (ARC), which introduces layer random replacement based on fine-grained self-attention distillation. This method first obtains the important features of the original model through fine-grained self-attention distillation in the pre-training distillation stage. More information can be obtained by extracting the upper layers of the large teacher model. Then, the one-to-one Transformer-layer random replacement training fully explores the hidden knowledge of the large pre-trained model in the fine-tuning compression stage. Compared with other complex compression methods, ARC not only simplifies the training process of model compression but also enhances the applicability of the compressed model. This paper compares knowledge distillation methods for pre-trained models of different sizes on the GLUE benchmark. Experimental results demonstrate that the proposed method achieves significant improvements across different parameter scales, especially in terms of accuracy and inference speed.
引用
收藏
页码:848 / 860
页数:13
相关论文
共 33 条
[1]   What does BERT look at? An Analysis of BERT's Attention [J].
Clark, Kevin ;
Khandelwal, Urvashi ;
Levy, Omer ;
Manning, Christopher D. .
BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, :276-286
[2]  
Denil M., 2013, ADV NEURAL INFORM PR, P2148, DOI DOI 10.5555/2999792.2999852
[3]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[4]  
Gong YC, 2014, LECT NOTES COMPUT SC, V8695, P392, DOI 10.1007/978-3-319-10584-0_26
[5]  
Han S., 2016, INT C LEARNING REPRE
[6]  
Han S., 2024, P 12 INT C LEARN REP
[7]  
Hinton G.E., 2015, Distilling the Knowledge in a Neural Network
[8]  
Hou L., 2020, Adv. Neural. Inf. Process. Syst., V33, P9782
[9]  
Hu MH, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P2077
[10]  
Jiao XQ, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P4163