Joint Dual Feature Distillation and Gradient Progressive Pruning for BERT compression

被引：0

作者：

Zhang, Zhou ^{[1
]}

Lu, Yang ^{[1
,2
]}

Wang, Tengfei ^{[1
]}

Wei, Xing ^{[1
,3
]}

Wei, Zhen ^{[1
,3
]}

机构：

[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China

[2] Anhui Mine IOT & Secur Monitoring Technol Key Lab, Hefei 230088, Peoples R China

[3] Hefei Univ Technol, Intelligent Mfg Inst, Hefei 230009, Peoples R China

来源：

NEURAL NETWORKS | 2024年 / 179卷

关键词：

Pre-trained model compression; Structured pruning; Knowledge distillation;

D O I：

10.1016/j.neunet.2024.106533

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The increasing size of pre-trained language models has led to a growing interest in model compression. Pruning and distillation are the primary methods employed to compress these models. Existing pruning and distillation methods are effective in maintaining model accuracy and reducing its size. However, they come with limitations. For instance, pruning is often suboptimal and biased by transforming it into a continuous optimization problem. Distillation relies primarily on one-to-one layer mappings for knowledge transfer, which leads to underutilization of the rich knowledge in teacher. Therefore, we propose a method of joint pruning and distillation for automatic pruning of pre-trained language models. Specifically, we first propose Gradient Progressive Pruning (GPP), which achieves a smooth transition of indicator vector values from real to binary progressively converging the values of unimportant units' indicator vectors to zero before the end of the search phase. This effectively overcomes the limitations of traditional pruning methods while supporting compression with higher sparsity. In addition, we propose the Dual Feature Distillation (DFD). DFD adaptively globally fuses teacher features and locally fuses student features, and then uses the dual features of global teacher features and local student features for knowledge distillation. This realizes a "preview-review"mechanism that can better extract useful information from multi-level teacher information and transfer it to student. Comparative experiments on the GLUE benchmark dataset and ablation experiments indicate that our method outperforms other state-of-the-art methods.

引用

页数：11

共 34 条

[1] iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks
Akbar, Shahid
Zou, Quan
Raza, Ali
Alarfaj, Fawaz Khaled
[J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 151
[2] Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model
Akbar, Shahid
Raza, Ali
Zou, Quan
[J]. BMC BIOINFORMATICS, 2024, 25 (01)
[3] pAtbP-EnC: Identifying Anti-Tubercular Peptides Using Multi-Feature Representation and Genetic Algorithm-Based Deep Ensemble Model
Akbar, Shahid
Raza, Ali
Al Shloul, Tamara
Ahmad, Ashfaq
Saeed, Aamir
Ghadi, Yazeed Yasin
Mamyrbayev, Orken
Tag-Eldin, Elsayed
[J]. IEEE ACCESS, 2023, 11 : 137099 - 137114
[4] cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model
Akbar, Shahid
Hayat, Maqsood
Tahir, Muhammad
Khan, Salman
Alarfaj, Fawaz Khaled
[J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2022, 131
[5] Brown TB, 2020, ADV NEUR IN, V33
[6] State Estimation for Genetic Regulatory Networks with Two Delay Components by Using Second-Order Reciprocally Convex Approach
Chandrasekar, A.
Radhika, T.
Zhu, Quanxin
[J]. NEURAL PROCESSING LETTERS, 2022, 54 (01) : 327 - 345
[7] Knowledge Distillation with the Reused Teacher Classifier
Chen, Defang
Mei, Jian-Ping
Zhang, Hailin
Wang, Can
Feng, Yan
Chen, Chun
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11923 - 11932
[8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9] Fan A., 2019, ICLR
[10] Compressing Large-Scale Transformer-Based Models: A Case Study on BERT
Ganesh, Prakhar
Chen, Yao
Lou, Xin
Khan, Mohammad Ali
Yang, Yin
Sajjad, Hassan
Nakov, Preslav
Chen, Deming
Winslett, Marianne
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 1061 - 1080

← 1 2 3 4 →