Parameter-Efficient and Student-Friendly Knowledge Distillation

被引：15

作者：

Rao, Jun ^{[1
,2
]}

Meng, Xv ^{[2
]}

Ding, Liang ^{[3
]}

Qi, Shuhan ^{[2
,4
]}

Liu, Xuebo ^{[2
]}

Zhang, Min ^{[2
]}

Tao, Dacheng ^{[3
]}

机构：

[1] JD Explore Acad, Beijing, Peoples R China

[2] Harbin Inst Technol, Shenzhen 518055, Peoples R China

[3] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia

[4] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518000, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Training; Smoothing methods; Knowledge transfer; Data models; Adaptation models; Predictive models; Knowledge engineering; Knowledge distillation; parameter-efficient; image classification;

D O I：

10.1109/TMM.2023.3321480

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Pre-trained models are frequently employed in multimodal learning. However, these models have too many parameters and need too much effort to fine-tune the downstream tasks. Knowledge distillation (KD) is a method to transfer knowledge using the soft label from this pre-trained teacher model to a smaller student, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, adjustment of temperature parameters, label smoothing and teacher-student joint training methods (online distillation) to smooth the soft label of a teacher network, have been proposed. But those methods rarely explain the effect of smoothed soft labels to enhance the KD performance. The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the pre-trained teacher's knowledge method, namely PESF-KD by adaptive tuning soft labels of the teacher network. Technically, we first mathematically formulate the mismatch as the sharpness gap between teacher's and student's predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on various benchmarks including CV and NLP show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.

引用

页码：4230 / 4241

页数：12

共 51 条

[1] Aghajanyan A., 2021, P 59 ANN M ASS COMPU, V1, P7319
[2] Distill on the Go: Online knowledge distillation in self-supervised learning
Bhat, Prashant
Arani, Elahe
Zonooz, Bahram
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 2672 - 2681
[3] Chandrasegaran K., 2022, PMLR, P2890
[4] Chen DF, 2020, AAAI CONF ARTIF INTE, V34, P3430
[5] Distilling Knowledge via Knowledge Review
Chen, Pengguang
Liu, Shu
Zhao, Hengshuang
Jia, Jiaya
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5006 - 5015
[6] On the Efficacy of Knowledge Distillation
Cho, Jang Hyun
Hariharan, Bharath
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4793 - 4801
[7] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[8] Knowledge Distillation: A Survey
Gou, Jianping
Yu, Baosheng
Maybank, Stephen J.
Tao, Dacheng
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (06) : 1789 - 1819
[9] Guo J., 2022, OpenReview preprint
[10] Guo QS, 2020, PROC CVPR IEEE, P11017, DOI 10.1109/CVPR42600.2020.01103

← 1 2 3 4 5 6 →