Improving Knowledge Distillation via Regularizing Feature Direction and Norm

被引：0

作者：

Wang, Yuzhu ^{[1
]}

Cheng, Lechao ^{[2
]}

Duan, Manni ^{[1
]}

Wang, Yongheng ^{[1
]}

Feng, Zunlei ^{[3
]}

Kong, Shu ^{[4
,5
,6
]}

机构：

[1] Zhejiang Lab, Hangzhou, Peoples R China

[2] Hefei Univ Technol, Hefei, Peoples R China

[3] Zhejiang Univ, Hangzhou, Peoples R China

[4] Univ Macau, Taipa, Macao, Peoples R China

[5] Inst Collaborat Innovat, Taipa, Macao, Peoples R China

[6] Texas A&M Univ, College Stn, TX USA

来源：

COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷

基金：

中国国家自然科学基金;

关键词：

knowledge distillation; large-norm; feature direction;

D O I：

10.1007/978-3-031-72691-0_2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge distillation (KD) is a particular technique of model compression that exploits a large well-trained teacher neural network to train a small student network. Treating teacher's feature as knowledge, prevailing methods train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence or L2-distance between their (logits) features. While it is natural to assume that better feature alignment helps distill teacher's knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g., classification accuracy. For example, minimizing the L2 distance between the penultimate-layer features (used to compute logits for classification) does not necessarily help learn a better student classifier. We are motivated to regularize student features at the penultimate layer using teacher towards training a better student classifier. Specifically, we present a rather simple method that uses teacher's class-mean features to align student features w.r.t their direction. Experiments show that this significantly improves KD performance. Moreover, we empirically find that student produces features that have notably smaller norms than teacher's, motivating us to regularize student to produce large-norm features. Experiments show that doing so also yields better performance. Finally, we present a simple loss as our main technical contribution that regularizes student by simultaneously (1) aligning the direction of its features with the teacher class-mean feature, and (2) encouraging it to produce large-norm features. Experiments on standard benchmarks demonstrate that adopting our technique remarkably improves existing KD methods, achieving the state-of-the-art KD performance through the lens of image classification (on ImageNet and CIFAR100 datasets) and object detection (on the COCO dataset).

引用

页码：20 / 37

页数：18

共 50 条

[41] Contrastive Knowledge Distillation Method Based on Feature Space Embedding
Ye F.
Chen B.
Lai Y.
Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2023, 51 (05): : 13 - 23
[42] Iterative filter pruning with combined feature maps and knowledge distillation
Liu, Yajun
Fan, Kefeng
Zhou, Wenju
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2025, 16 (03) : 1955 - 1969
[43] FCKDNet: A Feature Condensation Knowledge Distillation Network for Semantic Segmentation
Yuan, Wenhao
Lu, Xiaoyan
Zhang, Rongfen
Liu, Yuhong
ENTROPY, 2023, 25 (01)
[44] A Malware Classification Method Based on Knowledge Distillation and Feature Fusion
Guan, Xin
Zhang, Guodong
IEEE ACCESS, 2025, 13 : 51268 - 51276
[45] MULTICHANNEL ASR WITH KNOWLEDGE DISTILLATION AND GENERALIZED CROSS CORRELATION FEATURE
Li, Wenjie
Zhang, Yu
Zhang, Pengyuan
Ge, Fengpei
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 463 - 469
[46] Knowledge Distillation for Tiny Speech Enhancement with Latent Feature Augmentation
Gholami, Behnam
El-Khamy, Mostafa
Song, Kee-Bong
INTERSPEECH 2024, 2024, : 652 - 656
[47] DKD-pFed: A novel framework for personalized federated learning via decoupling knowledge distillation and feature decorrelation
Su, Liwei
Wang, Donghao
Zhu, Jinghua
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 259
[48] A Lightweight, Arbitrary-oriented SAR Ship Detector via Feature Map-based Knowledge Distillation
Chen S.
Wang W.
Zhan R.
Zhang J.
Liu S.
Journal of Radars, 2023, 12 (01) : 140 - 153
[49] Improving the Consistency of Semantic Parsing in KBQA Through Knowledge Distillation
Zou, Jun
Cao, Shulin
Wan, Jing
Hou, Lei
Xu, Jianjun
WEB AND BIG DATA, PT III, APWEB-WAIM 2023, 2024, 14333 : 373 - 388
[50] Improving knowledge distillation using unified ensembles of specialized teachers
Zaras, Adamantios
Passalis, Nikolaos
Tefas, Anastasios
PATTERN RECOGNITION LETTERS, 2021, 146 (146) : 215 - 221

← 1 2 3 4 5 →