Model compression through distillation with cross-layer integrated guidance at word level

被引：0

作者：

Li, Guiyu ^{[1
]}

Zheng, Shang ^{[1
]}

Zou, Haitao ^{[1
]}

Yu, Hualong ^{[1
]}

Gao, Shang ^{[1
]}

机构：

[1] Jiangsu Univ Sci & Technol, Sch Comp, Zhenjiang 212100, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 619卷

基金：

中国国家自然科学基金;

关键词：

Software engineering; Knowledge distillation; Model compression; Word-level association; Cross-layer connection;

D O I：

10.1016/j.neucom.2024.129162

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In various academic and applied domains including software engineering, lightweight software applications can be facilitated by knowledge distillation, which involves the transfer of insights from a teacher model to a student model, enabling the latter to gradually replicate the behavior of the former, thus achieving model compression and acceleration. Nonetheless, due to differences in the capacity between teacher and student models, the mixed types of knowledge in hidden representations, and variations in the proportion of knowledge in the hidden representations of each layer, there is still room for improvement in knowledge distillation. This paper proposes a Cross-layer Integration of Word-level Association (CI-WA) knowledge distillation. Firstly, CI-WA introduces an extractor constructed by dynamic sparse attention, which extracts task- related word-level associations from the hidden vectors at each layer, mitigating the influence of task-irrelevant information. Secondly, this paper introduces cross-layer connections into the knowledge distillation process, enhancing the performance of student model by jointly leveraging high-level and low-level features. Finally, the proposed method is validated on two tasks: natural language understanding and modeling. Experimental results illustrate that the proposed method surpasses state-of-the-art techniques, and language modeling provides a new perspective for knowledge distillation.

引用

页数：15

共 65 条

[1] Representation Learning: A Review and New Perspectives
Bengio, Yoshua
Courville, Aaron
Vincent, Pascal
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) : 1798 - 1828
[2] Bentivogli L., 2011, P 4 TEXT AN C, P1
[3] Brunner G., 2020, ICLR
[4] Cer Daniel, 2017, P 11 INT WORKSHOP SE, P1, DOI 10.18653/v1/S17-2001
[5] Distilling Knowledge via Knowledge Review
Chen, Pengguang
Liu, Shu
Zhao, Hengshuang
Jia, Jiaya
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5006 - 5015
[6] Chen YD, 2022, ADV NEUR IN
[7] Chen Z., 2018, Quora question pairs
[8] Dual Aggregation Transformer for Image Super-Resolution
Chen, Zheng
Zhang, Yulun
Gu, Jinjin
Kong, Linghe
Yang, Xiaokang
Yu, Fisher
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12278 - 12287
[9] Child R, 2019, Arxiv, DOI arXiv:1904.10509
[10] Cho YJ, 2024, Arxiv, DOI arXiv:2401.06432

← 1 2 3 4 5 6 7 →