Model compression through distillation with cross-layer integrated guidance at word level

被引:0
作者
Li, Guiyu [1 ]
Zheng, Shang [1 ]
Zou, Haitao [1 ]
Yu, Hualong [1 ]
Gao, Shang [1 ]
机构
[1] Jiangsu Univ Sci & Technol, Sch Comp, Zhenjiang 212100, Peoples R China
基金
中国国家自然科学基金;
关键词
Software engineering; Knowledge distillation; Model compression; Word-level association; Cross-layer connection;
D O I
10.1016/j.neucom.2024.129162
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In various academic and applied domains including software engineering, lightweight software applications can be facilitated by knowledge distillation, which involves the transfer of insights from a teacher model to a student model, enabling the latter to gradually replicate the behavior of the former, thus achieving model compression and acceleration. Nonetheless, due to differences in the capacity between teacher and student models, the mixed types of knowledge in hidden representations, and variations in the proportion of knowledge in the hidden representations of each layer, there is still room for improvement in knowledge distillation. This paper proposes a Cross-layer Integration of Word-level Association (CI-WA) knowledge distillation. Firstly, CI-WA introduces an extractor constructed by dynamic sparse attention, which extracts task- related word-level associations from the hidden vectors at each layer, mitigating the influence of task-irrelevant information. Secondly, this paper introduces cross-layer connections into the knowledge distillation process, enhancing the performance of student model by jointly leveraging high-level and low-level features. Finally, the proposed method is validated on two tasks: natural language understanding and modeling. Experimental results illustrate that the proposed method surpasses state-of-the-art techniques, and language modeling provides a new perspective for knowledge distillation.
引用
收藏
页数:15
相关论文
共 65 条
  • [1] Representation Learning: A Review and New Perspectives
    Bengio, Yoshua
    Courville, Aaron
    Vincent, Pascal
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) : 1798 - 1828
  • [2] Bentivogli L., 2011, P 4 TEXT AN C, P1
  • [3] Brunner G., 2020, ICLR
  • [4] Cer Daniel, 2017, P 11 INT WORKSHOP SE, P1, DOI 10.18653/v1/S17-2001
  • [5] Distilling Knowledge via Knowledge Review
    Chen, Pengguang
    Liu, Shu
    Zhao, Hengshuang
    Jia, Jiaya
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5006 - 5015
  • [6] Chen YD, 2022, ADV NEUR IN
  • [7] Chen Z., 2018, Quora question pairs
  • [8] Dual Aggregation Transformer for Image Super-Resolution
    Chen, Zheng
    Zhang, Yulun
    Gu, Jinjin
    Kong, Linghe
    Yang, Xiaokang
    Yu, Fisher
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12278 - 12287
  • [9] Child R, 2019, Arxiv, DOI arXiv:1904.10509
  • [10] Cho YJ, 2024, Arxiv, DOI arXiv:2401.06432