Joint Dual Feature Distillation and Gradient Progressive Pruning for BERT compression

被引:0
作者
Zhang, Zhou [1 ]
Lu, Yang [1 ,2 ]
Wang, Tengfei [1 ]
Wei, Xing [1 ,3 ]
Wei, Zhen [1 ,3 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
[2] Anhui Mine IOT & Secur Monitoring Technol Key Lab, Hefei 230088, Peoples R China
[3] Hefei Univ Technol, Intelligent Mfg Inst, Hefei 230009, Peoples R China
关键词
Pre-trained model compression; Structured pruning; Knowledge distillation;
D O I
10.1016/j.neunet.2024.106533
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The increasing size of pre-trained language models has led to a growing interest in model compression. Pruning and distillation are the primary methods employed to compress these models. Existing pruning and distillation methods are effective in maintaining model accuracy and reducing its size. However, they come with limitations. For instance, pruning is often suboptimal and biased by transforming it into a continuous optimization problem. Distillation relies primarily on one-to-one layer mappings for knowledge transfer, which leads to underutilization of the rich knowledge in teacher. Therefore, we propose a method of joint pruning and distillation for automatic pruning of pre-trained language models. Specifically, we first propose Gradient Progressive Pruning (GPP), which achieves a smooth transition of indicator vector values from real to binary progressively converging the values of unimportant units' indicator vectors to zero before the end of the search phase. This effectively overcomes the limitations of traditional pruning methods while supporting compression with higher sparsity. In addition, we propose the Dual Feature Distillation (DFD). DFD adaptively globally fuses teacher features and locally fuses student features, and then uses the dual features of global teacher features and local student features for knowledge distillation. This realizes a "preview-review"mechanism that can better extract useful information from multi-level teacher information and transfer it to student. Comparative experiments on the GLUE benchmark dataset and ablation experiments indicate that our method outperforms other state-of-the-art methods.
引用
收藏
页数:11
相关论文
共 34 条
  • [1] iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks
    Akbar, Shahid
    Zou, Quan
    Raza, Ali
    Alarfaj, Fawaz Khaled
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 151
  • [2] Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model
    Akbar, Shahid
    Raza, Ali
    Zou, Quan
    [J]. BMC BIOINFORMATICS, 2024, 25 (01)
  • [3] pAtbP-EnC: Identifying Anti-Tubercular Peptides Using Multi-Feature Representation and Genetic Algorithm-Based Deep Ensemble Model
    Akbar, Shahid
    Raza, Ali
    Al Shloul, Tamara
    Ahmad, Ashfaq
    Saeed, Aamir
    Ghadi, Yazeed Yasin
    Mamyrbayev, Orken
    Tag-Eldin, Elsayed
    [J]. IEEE ACCESS, 2023, 11 : 137099 - 137114
  • [4] cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model
    Akbar, Shahid
    Hayat, Maqsood
    Tahir, Muhammad
    Khan, Salman
    Alarfaj, Fawaz Khaled
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2022, 131
  • [5] Brown TB, 2020, ADV NEUR IN, V33
  • [6] State Estimation for Genetic Regulatory Networks with Two Delay Components by Using Second-Order Reciprocally Convex Approach
    Chandrasekar, A.
    Radhika, T.
    Zhu, Quanxin
    [J]. NEURAL PROCESSING LETTERS, 2022, 54 (01) : 327 - 345
  • [7] Knowledge Distillation with the Reused Teacher Classifier
    Chen, Defang
    Mei, Jian-Ping
    Zhang, Hailin
    Wang, Can
    Feng, Yan
    Chen, Chun
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11923 - 11932
  • [8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [9] Fan A., 2019, ICLR
  • [10] Compressing Large-Scale Transformer-Based Models: A Case Study on BERT
    Ganesh, Prakhar
    Chen, Yao
    Lou, Xin
    Khan, Mohammad Ali
    Yang, Yin
    Sajjad, Hassan
    Nakov, Preslav
    Chen, Deming
    Winslett, Marianne
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 1061 - 1080