Improving Knowledge Distillation via Regularizing Feature Direction and Norm

被引:0
作者
Wang, Yuzhu [1 ]
Cheng, Lechao [2 ]
Duan, Manni [1 ]
Wang, Yongheng [1 ]
Feng, Zunlei [3 ]
Kong, Shu [4 ,5 ,6 ]
机构
[1] Zhejiang Lab, Hangzhou, Peoples R China
[2] Hefei Univ Technol, Hefei, Peoples R China
[3] Zhejiang Univ, Hangzhou, Peoples R China
[4] Univ Macau, Taipa, Macao, Peoples R China
[5] Inst Collaborat Innovat, Taipa, Macao, Peoples R China
[6] Texas A&M Univ, College Stn, TX USA
来源
COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷
基金
中国国家自然科学基金;
关键词
knowledge distillation; large-norm; feature direction;
D O I
10.1007/978-3-031-72691-0_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation (KD) is a particular technique of model compression that exploits a large well-trained teacher neural network to train a small student network. Treating teacher's feature as knowledge, prevailing methods train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence or L2-distance between their (logits) features. While it is natural to assume that better feature alignment helps distill teacher's knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g., classification accuracy. For example, minimizing the L2 distance between the penultimate-layer features (used to compute logits for classification) does not necessarily help learn a better student classifier. We are motivated to regularize student features at the penultimate layer using teacher towards training a better student classifier. Specifically, we present a rather simple method that uses teacher's class-mean features to align student features w.r.t their direction. Experiments show that this significantly improves KD performance. Moreover, we empirically find that student produces features that have notably smaller norms than teacher's, motivating us to regularize student to produce large-norm features. Experiments show that doing so also yields better performance. Finally, we present a simple loss as our main technical contribution that regularizes student by simultaneously (1) aligning the direction of its features with the teacher class-mean feature, and (2) encouraging it to produce large-norm features. Experiments on standard benchmarks demonstrate that adopting our technique remarkably improves existing KD methods, achieving the state-of-the-art KD performance through the lens of image classification (on ImageNet and CIFAR100 datasets) and object detection (on the COCO dataset).
引用
收藏
页码:20 / 37
页数:18
相关论文
共 50 条
  • [31] Improving the Interpretability of Deep Neural Networks with Knowledge Distillation
    Liu, Xuan
    Wang, Xiaoguang
    Matwin, Stan
    2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2018, : 905 - 912
  • [32] Generalized Knowledge Distillation via Relationship Matching
    Ye, Han-Jia
    Lu, Su
    Zhan, De-Chuan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 1817 - 1834
  • [33] Knowledge Distillation via Channel Correlation Structure
    Li, Bo
    Chen, Bin
    Wang, Yunxiao
    Dai, Tao
    Hu, Maowei
    Jiang, Yong
    Xia, Shutao
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, 2021, 12815 : 357 - 368
  • [34] Ensembled CTR Prediction via Knowledge Distillation
    Zhu, Jieming
    Liu, Jinyang
    Li, Weiqi
    Lai, Jincai
    He, Xiuqiang
    Chen, Liang
    Zheng, Zibin
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 2941 - 2948
  • [35] A Virtual Knowledge Distillation via Conditional GAN
    Kim, Sihwan
    IEEE ACCESS, 2022, 10 : 34766 - 34778
  • [36] Utilizing Video Word Boundaries and Feature-Based Knowledge Distillation Improving Sentence-Level Lip Reading
    Zhen, Hongzhong
    Jiang, Chenglong
    Zhou, Jiyong
    Liang, Liming
    Gao, Ying
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VI, 2024, 14430 : 269 - 281
  • [37] Iterative filter pruning with combined feature maps and knowledge distillation
    Liu, Yajun
    Fan, Kefeng
    Zhou, Wenju
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2025, 16 (03) : 1955 - 1969
  • [38] Contrastive Knowledge Distillation Method Based on Feature Space Embedding
    Ye F.
    Chen B.
    Lai Y.
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2023, 51 (05): : 13 - 23
  • [39] FCKDNet: A Feature Condensation Knowledge Distillation Network for Semantic Segmentation
    Yuan, Wenhao
    Lu, Xiaoyan
    Zhang, Rongfen
    Liu, Yuhong
    ENTROPY, 2023, 25 (01)
  • [40] Knowledge Distillation for Tiny Speech Enhancement with Latent Feature Augmentation
    Gholami, Behnam
    El-Khamy, Mostafa
    Song, Kee-Bong
    INTERSPEECH 2024, 2024, : 652 - 656