Why does Knowledge Distillation work? Rethink its attention and fidelity mechanism

被引：2

作者：

Guo, Chenqi ^{[1
]}

Zhong, Shiwei ^{[1
]}

Liu, Xiaofeng ^{[1
]}

Feng, Qianli ^{[2
]}

Ma, Yinglong ^{[1
]}

机构：

[1] North China Elect Power Univ, Control & Comp Engn, 2 Beinong Rd, Beijing 102206, Peoples R China

[2] Amazon, 300 Boren Ave N, Seattle, WA 98109 USA

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2025年 / 262卷

关键词：

Knowledge distillation; Ensemble learning; Attention mechanism; Supervised image classification; Data augmentation; DATA AUGMENTATION;

D O I：

10.1016/j.eswa.2024.125579

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Does Knowledge Distillation (KD) really work? Conventional wisdom viewed it as a knowledge transfer procedure where a perfect mimicry of the student to its teacher is desired. However, paradoxical studies indicate that closely replicating the teacher's behavior does not consistently improve student generalization, posing questions on its possible causes. Confronted with this gap, we hypothesize that diverse attentions in teachers contribute to better student generalization at the expense of reduced fidelity in ensemble KD setups. Focusing on supervised image classification task, by increasing data augmentation strengths, our key findings reveal a decrease in the Intersection over Union (IoU) of attentions between teacher models, leading to reduced student overfitting and decreased fidelity. We propose this low-fidelity phenomenon as an underlying characteristic rather than a pathology when training KD. This suggests that stronger data augmentation fosters a broader perspective provided by the divergent teacher ensemble and lower student-teacher mutual information, benefiting generalization performance. We further demonstrate that even optimization towards logits-matching between teachers and student can hardly mitigate this low-fidelity effect. These insights clarify the mechanism on low-fidelity phenomenon in KD. Thus, we offer new perspectives on optimizing student model performance, by emphasizing increased diversity in teacher attentions and reduced mimicry behavior between teachers and student. Codes are available at https://github.com/zisci2/RethinkKD

引用

页数：14

共 36 条

[1]

Allen-Zhu Z., 2023, INT C LEARN REPR

[2]

[Anonymous], 2009, Cifar-10

[3]

Asif U., 2019, EUR C ART INT

[4] Masked Autoencoders Enable Efficient Knowledge Distillers [J].

Bai, Yutong ;

Wang, Zeyu ;

Xiao, Junfei ;

Wei, Chen ;

Wang, Huiyu ;

Yuille, Alan ;

Zhou, Yuyin ;

Xie, Cihang .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :24256-24265

[5]

Cubuk E. D., 2021, ICLR

[6] Randaugment: Practical automated data augmentation with a reduced search space [J].

Cubuk, Ekin D. ;

Zoph, Barret ;

Shlens, Jonathon ;

Le, Quoc, V .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, :3008-3017

[7] AutoAugment: Learning Augmentation Strategies from Data [J].

Cubuk, Ekin D. ;

Zoph, Barret ;

Mane, Dandelion ;

Vasudevan, Vijay ;

Le, Quoc V. .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :113-123

[8] Black-Box Few-Shot Knowledge Distillation [J].

Dang Nguyen ;

Gupta, Sunil ;

Do, Kien ;

Venkatesh, Svetha .

COMPUTER VISION, ECCV 2022, PT XXI, 2022, 13681 :196-211

[9]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[10]

Do K., 2022, ADV NEUR IN

← 1 2 3 4 →