Parameter-Efficient and Student-Friendly Knowledge Distillation

被引:15
作者
Rao, Jun [1 ,2 ]
Meng, Xv [2 ]
Ding, Liang [3 ]
Qi, Shuhan [2 ,4 ]
Liu, Xuebo [2 ]
Zhang, Min [2 ]
Tao, Dacheng [3 ]
机构
[1] JD Explore Acad, Beijing, Peoples R China
[2] Harbin Inst Technol, Shenzhen 518055, Peoples R China
[3] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
[4] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518000, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Smoothing methods; Knowledge transfer; Data models; Adaptation models; Predictive models; Knowledge engineering; Knowledge distillation; parameter-efficient; image classification;
D O I
10.1109/TMM.2023.3321480
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pre-trained models are frequently employed in multimodal learning. However, these models have too many parameters and need too much effort to fine-tune the downstream tasks. Knowledge distillation (KD) is a method to transfer knowledge using the soft label from this pre-trained teacher model to a smaller student, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, adjustment of temperature parameters, label smoothing and teacher-student joint training methods (online distillation) to smooth the soft label of a teacher network, have been proposed. But those methods rarely explain the effect of smoothed soft labels to enhance the KD performance. The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the pre-trained teacher's knowledge method, namely PESF-KD by adaptive tuning soft labels of the teacher network. Technically, we first mathematically formulate the mismatch as the sharpness gap between teacher's and student's predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on various benchmarks including CV and NLP show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
引用
收藏
页码:4230 / 4241
页数:12
相关论文
共 51 条
  • [41] Tian Yonglong, 2020, ICLR
  • [42] van der Maaten L, 2008, J MACH LEARN RES, V9, P2579
  • [43] Wang A., PROC INT C LEARN REP, P1
  • [44] Xu CW, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P10653
  • [45] Xu CW, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P7859
  • [46] CKD: Cross-Task Knowledge Distillation for Text-to-Image Synthesis
    Yuan, Mingkuan
    Peng, Yuxin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (08) : 1955 - 1968
  • [47] Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning
    Zhang, Lin
    Shen, Li
    Ding, Liang
    Tao, Dacheng
    Duan, Ling-Yu
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10164 - 10173
  • [48] Deep Mutual Learning
    Zhang, Ying
    Xiang, Tao
    Hospedales, Timothy M.
    Lu, Huchuan
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4320 - 4328
  • [49] Zhong QH, 2024, Arxiv, DOI arXiv:2208.10160
  • [50] Zhou WCS, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P7037