Joint Dual Feature Distillation and Gradient Progressive Pruning for BERT compression

被引:0
作者
Zhang, Zhou [1 ]
Lu, Yang [1 ,2 ]
Wang, Tengfei [1 ]
Wei, Xing [1 ,3 ]
Wei, Zhen [1 ,3 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
[2] Anhui Mine IOT & Secur Monitoring Technol Key Lab, Hefei 230088, Peoples R China
[3] Hefei Univ Technol, Intelligent Mfg Inst, Hefei 230009, Peoples R China
关键词
Pre-trained model compression; Structured pruning; Knowledge distillation;
D O I
10.1016/j.neunet.2024.106533
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The increasing size of pre-trained language models has led to a growing interest in model compression. Pruning and distillation are the primary methods employed to compress these models. Existing pruning and distillation methods are effective in maintaining model accuracy and reducing its size. However, they come with limitations. For instance, pruning is often suboptimal and biased by transforming it into a continuous optimization problem. Distillation relies primarily on one-to-one layer mappings for knowledge transfer, which leads to underutilization of the rich knowledge in teacher. Therefore, we propose a method of joint pruning and distillation for automatic pruning of pre-trained language models. Specifically, we first propose Gradient Progressive Pruning (GPP), which achieves a smooth transition of indicator vector values from real to binary progressively converging the values of unimportant units' indicator vectors to zero before the end of the search phase. This effectively overcomes the limitations of traditional pruning methods while supporting compression with higher sparsity. In addition, we propose the Dual Feature Distillation (DFD). DFD adaptively globally fuses teacher features and locally fuses student features, and then uses the dual features of global teacher features and local student features for knowledge distillation. This realizes a "preview-review"mechanism that can better extract useful information from multi-level teacher information and transfer it to student. Comparative experiments on the GLUE benchmark dataset and ablation experiments indicate that our method outperforms other state-of-the-art methods.
引用
收藏
页数:11
相关论文
共 34 条
[31]   SemCKD: Semantic Calibration for Cross-Layer Knowledge Distillation [J].
Wang, Can ;
Chen, Defang ;
Mei, Jian-Ping ;
Zhang, Yuan ;
Feng, Yan ;
Chen, Chun .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (06) :6305-6319
[32]  
Wang Z, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P6151
[33]  
X MZ, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P1513
[34]  
Xu CW, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P7859