Parameter-Efficient and Student-Friendly Knowledge Distillation

被引:15
作者
Rao, Jun [1 ,2 ]
Meng, Xv [2 ]
Ding, Liang [3 ]
Qi, Shuhan [2 ,4 ]
Liu, Xuebo [2 ]
Zhang, Min [2 ]
Tao, Dacheng [3 ]
机构
[1] JD Explore Acad, Beijing, Peoples R China
[2] Harbin Inst Technol, Shenzhen 518055, Peoples R China
[3] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
[4] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518000, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Smoothing methods; Knowledge transfer; Data models; Adaptation models; Predictive models; Knowledge engineering; Knowledge distillation; parameter-efficient; image classification;
D O I
10.1109/TMM.2023.3321480
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pre-trained models are frequently employed in multimodal learning. However, these models have too many parameters and need too much effort to fine-tune the downstream tasks. Knowledge distillation (KD) is a method to transfer knowledge using the soft label from this pre-trained teacher model to a smaller student, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, adjustment of temperature parameters, label smoothing and teacher-student joint training methods (online distillation) to smooth the soft label of a teacher network, have been proposed. But those methods rarely explain the effect of smoothed soft labels to enhance the KD performance. The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the pre-trained teacher's knowledge method, namely PESF-KD by adaptive tuning soft labels of the teacher network. Technically, we first mathematically formulate the mismatch as the sharpness gap between teacher's and student's predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on various benchmarks including CV and NLP show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
引用
收藏
页码:4230 / 4241
页数:12
相关论文
共 51 条
  • [1] Aghajanyan A., 2021, P 59 ANN M ASS COMPU, V1, P7319
  • [2] Distill on the Go: Online knowledge distillation in self-supervised learning
    Bhat, Prashant
    Arani, Elahe
    Zonooz, Bahram
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 2672 - 2681
  • [3] Chandrasegaran K., 2022, PMLR, P2890
  • [4] Chen DF, 2020, AAAI CONF ARTIF INTE, V34, P3430
  • [5] Distilling Knowledge via Knowledge Review
    Chen, Pengguang
    Liu, Shu
    Zhao, Hengshuang
    Jia, Jiaya
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5006 - 5015
  • [6] On the Efficacy of Knowledge Distillation
    Cho, Jang Hyun
    Hariharan, Bharath
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4793 - 4801
  • [7] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [8] Knowledge Distillation: A Survey
    Gou, Jianping
    Yu, Baosheng
    Maybank, Stephen J.
    Tao, Dacheng
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (06) : 1789 - 1819
  • [9] Guo J., 2022, OpenReview preprint
  • [10] Guo QS, 2020, PROC CVPR IEEE, P11017, DOI 10.1109/CVPR42600.2020.01103