DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy

被引:2
作者
Cheng, Xin [1 ]
Zhang, Zhiqiang [2 ]
Weng, Wei [3 ]
Yu, Wenxin [2 ]
Zhou, Jinjia [1 ]
机构
[1] Hosei Univ, Grad Sch Sci & Engn, Tokyo 1848584, Japan
[2] Southwest Univ Sci & Technol, Sch Sci & Technol, Mianyang 621010, Peoples R China
[3] Kanazawa Univ, Inst Liberal Arts & Sci, Kanazawa 9201192, Japan
关键词
multi-teacher knowledge distillation; image classification; entropy; deep learning;
D O I
10.3390/math12111672
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
The complexity of deep neural network models (DNNs) severely limits their application on devices with limited computing and storage resources. Knowledge distillation (KD) is an attractive model compression technology that can effectively alleviate this problem. Multi-teacher knowledge distillation (MKD) aims to leverage the valuable and diverse knowledge distilled by multiple teacher networks to improve the performance of the student network. Existing approaches typically rely on simple methods such as averaging the prediction logits or using sub-optimal weighting strategies to fuse distilled knowledge from multiple teachers. However, employing these techniques cannot fully reflect the importance of teachers and may even mislead student's learning. To address this issue, we propose a novel Decoupled Multi-Teacher Knowledge Distillation based on Entropy (DE-MKD). DE-MKD decouples the vanilla knowledge distillation loss and assigns adaptive weights to each teacher to reflect its importance based on the entropy of their predictions. Furthermore, we extend the proposed approach to distill the intermediate features from multiple powerful but cumbersome teachers to improve the performance of the lightweight student network. Extensive experiments on the publicly available CIFAR-100 image classification benchmark dataset with various teacher-student network pairs demonstrated the effectiveness and flexibility of our approach. For instance, the VGG8|ShuffleNetV2 model trained by DE-MKD reached 75.25%|78.86% top-one accuracy when choosing VGG13|WRN40-2 as the teacher, setting new performance records. In addition, surprisingly, the distilled student model outperformed the teacher in both teacher-student network pairs.
引用
收藏
页数:10
相关论文
共 35 条
  • [1] Knowledge Distillation with the Reused Teacher Classifier
    Chen, Defang
    Mei, Jian-Ping
    Zhang, Hailin
    Wang, Can
    Feng, Yan
    Chen, Chun
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11923 - 11932
  • [2] Du Shangchen, 2020, advances in neural information processing systems, V33, P12345
  • [3] Efficient Knowledge Distillation from an Ensemble of Teachers
    Fukuda, Takashi
    Suzuki, Masayuki
    Kurata, Gakuto
    Thomas, Samuel
    Cui, Jia
    Ramabhadran, Bhuvana
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3697 - 3701
  • [4] Knowledge Distillation: A Survey
    Gou, Jianping
    Yu, Baosheng
    Maybank, Stephen J.
    Tao, Dacheng
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (06) : 1789 - 1819
  • [5] He KM, 2020, IEEE T PATTERN ANAL, V42, P386, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]
  • [6] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [7] Hinton G, 2015, Arxiv, DOI [arXiv:1503.02531, DOI 10.48550/ARXIV.1503.02531]
  • [8] A fast learning algorithm for deep belief nets
    Hinton, Geoffrey E.
    Osindero, Simon
    Teh, Yee-Whye
    [J]. NEURAL COMPUTATION, 2006, 18 (07) : 1527 - 1554
  • [9] Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]
  • [10] Krizhevsky Alex, 2009, LEARNING MULTIPLE LA