Why does Knowledge Distillation work? Rethink its attention and fidelity mechanism

被引:0
作者
Guo, Chenqi [1 ]
Zhong, Shiwei [1 ]
Liu, Xiaofeng [1 ]
Feng, Qianli [2 ]
Ma, Yinglong [1 ]
机构
[1] North China Elect Power Univ, Control & Comp Engn, 2 Beinong Rd, Beijing 102206, Peoples R China
[2] Amazon, 300 Boren Ave N, Seattle, WA 98109 USA
关键词
Knowledge distillation; Ensemble learning; Attention mechanism; Supervised image classification; Data augmentation; DATA AUGMENTATION;
D O I
10.1016/j.eswa.2024.125579
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Does Knowledge Distillation (KD) really work? Conventional wisdom viewed it as a knowledge transfer procedure where a perfect mimicry of the student to its teacher is desired. However, paradoxical studies indicate that closely replicating the teacher's behavior does not consistently improve student generalization, posing questions on its possible causes. Confronted with this gap, we hypothesize that diverse attentions in teachers contribute to better student generalization at the expense of reduced fidelity in ensemble KD setups. Focusing on supervised image classification task, by increasing data augmentation strengths, our key findings reveal a decrease in the Intersection over Union (IoU) of attentions between teacher models, leading to reduced student overfitting and decreased fidelity. We propose this low-fidelity phenomenon as an underlying characteristic rather than a pathology when training KD. This suggests that stronger data augmentation fosters a broader perspective provided by the divergent teacher ensemble and lower student-teacher mutual information, benefiting generalization performance. We further demonstrate that even optimization towards logits-matching between teachers and student can hardly mitigate this low-fidelity effect. These insights clarify the mechanism on low-fidelity phenomenon in KD. Thus, we offer new perspectives on optimizing student model performance, by emphasizing increased diversity in teacher attentions and reduced mimicry behavior between teachers and student. Codes are available at https://github.com/zisci2/RethinkKD
引用
收藏
页数:14
相关论文
共 35 条
  • [1] Allen-Zhu Z., 2023, INT C LEARN REPR
  • [2] Asif Umar, 2019, EUR C ART INT
  • [3] Masked Autoencoders Enable Efficient Knowledge Distillers
    Bai, Yutong
    Wang, Zeyu
    Xiao, Junfei
    Wei, Chen
    Wang, Huiyu
    Yuille, Alan
    Zhou, Yuyin
    Xie, Cihang
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 24256 - 24265
  • [4] Cubuk E.D., 2021, ICLR
  • [5] Randaugment: Practical automated data augmentation with a reduced search space
    Cubuk, Ekin D.
    Zoph, Barret
    Shlens, Jonathon
    Le, Quoc, V
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 3008 - 3017
  • [6] AutoAugment: Learning Augmentation Strategies from Data
    Cubuk, Ekin D.
    Zoph, Barret
    Mane, Dandelion
    Vasudevan, Vijay
    Le, Quoc V.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 113 - 123
  • [7] Black-Box Few-Shot Knowledge Distillation
    Dang Nguyen
    Gupta, Sunil
    Do, Kien
    Venkatesh, Svetha
    [J]. COMPUTER VISION, ECCV 2022, PT XXI, 2022, 13681 : 196 - 211
  • [8] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [9] Dosovitskiy A., 2021, ICLR
  • [10] Reciprocal Teacher-Student Learning via Forward and Feedback Knowledge Distillation
    Gou, Jianping
    Chen, Yu
    Yu, Baosheng
    Liu, Jinhua
    Du, Lan
    Wan, Shaohua
    Yi, Zhang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7901 - 7916