Improving Knowledge Distillation via Regularizing Feature Direction and Norm

被引:0
|
作者
Wang, Yuzhu [1 ]
Cheng, Lechao [2 ]
Duan, Manni [1 ]
Wang, Yongheng [1 ]
Feng, Zunlei [3 ]
Kong, Shu [4 ,5 ,6 ]
机构
[1] Zhejiang Lab, Hangzhou, Peoples R China
[2] Hefei Univ Technol, Hefei, Peoples R China
[3] Zhejiang Univ, Hangzhou, Peoples R China
[4] Univ Macau, Taipa, Macao, Peoples R China
[5] Inst Collaborat Innovat, Taipa, Macao, Peoples R China
[6] Texas A&M Univ, College Stn, TX USA
来源
COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷
基金
中国国家自然科学基金;
关键词
knowledge distillation; large-norm; feature direction;
D O I
10.1007/978-3-031-72691-0_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation (KD) is a particular technique of model compression that exploits a large well-trained teacher neural network to train a small student network. Treating teacher's feature as knowledge, prevailing methods train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence or L2-distance between their (logits) features. While it is natural to assume that better feature alignment helps distill teacher's knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g., classification accuracy. For example, minimizing the L2 distance between the penultimate-layer features (used to compute logits for classification) does not necessarily help learn a better student classifier. We are motivated to regularize student features at the penultimate layer using teacher towards training a better student classifier. Specifically, we present a rather simple method that uses teacher's class-mean features to align student features w.r.t their direction. Experiments show that this significantly improves KD performance. Moreover, we empirically find that student produces features that have notably smaller norms than teacher's, motivating us to regularize student to produce large-norm features. Experiments show that doing so also yields better performance. Finally, we present a simple loss as our main technical contribution that regularizes student by simultaneously (1) aligning the direction of its features with the teacher class-mean feature, and (2) encouraging it to produce large-norm features. Experiments on standard benchmarks demonstrate that adopting our technique remarkably improves existing KD methods, achieving the state-of-the-art KD performance through the lens of image classification (on ImageNet and CIFAR100 datasets) and object detection (on the COCO dataset).
引用
收藏
页码:20 / 37
页数:18
相关论文
共 50 条
  • [1] Regularizing Brain Age Prediction via Gated Knowledge Distillation
    Yang, Yanwu
    Guo, Xutao
    Ye, Chenfei
    Xiang, Yang
    Ma, Ting
    INTERNATIONAL CONFERENCE ON MEDICAL IMAGING WITH DEEP LEARNING, VOL 172, 2022, 172 : 1430 - 1443
  • [2] Improving drug-target affinity prediction via feature fusion and knowledge distillation
    Lu, Ruiqiang
    Wang, Jun
    Li, Pengyong
    Li, Yuquan
    Tan, Shuoyan
    Pan, Yiting
    Liu, Huanxiang
    Gao, Peng
    Xie, Guotong
    Yao, Xiaojun
    BRIEFINGS IN BIOINFORMATICS, 2023, 24 (03)
  • [3] Knowledge distillation via Noisy Feature Reconstruction
    Shi, Chaokun
    Hao, Yuexing
    Li, Gongyan
    Xu, Shaoyun
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 257
  • [4] Improving knowledge distillation via an expressive teacher
    Tan, Chao
    Liu, Jie
    Zhang, Xiang
    KNOWLEDGE-BASED SYSTEMS, 2021, 218
  • [5] Regularizing CNN via Feature Augmentation
    Ou, Liechuan
    Chen, Zheng
    Lu, Jianwei
    Luo, Ye
    NEURAL INFORMATION PROCESSING (ICONIP 2017), PT II, 2017, 10635 : 325 - 332
  • [6] Improving Deep Mutual Learning via Knowledge Distillation
    Lukman, Achmad
    Yang, Chuan-Kai
    APPLIED SCIENCES-BASEL, 2022, 12 (15):
  • [7] Improving Knowledge Distillation via Head and Tail Categories
    Xu, Liuchi
    Ren, Jin
    Huang, Zhenhua
    Zheng, Weishi
    Chen, Yunwen
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3465 - 3480
  • [8] Logitwise Distillation Network: Improving Knowledge Distillation via Introducing Sample Confidence
    Shen, Teng
    Cui, Zhenchao
    Qi, Jing
    APPLIED SCIENCES-BASEL, 2025, 15 (05):
  • [9] Link prediction via adversarial knowledge distillation and feature aggregation
    Li, Wen
    Song, Xiaoning
    Zhang, Wenjie
    Hua, Yang
    Wu, Xiaojun
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [10] Knowledge Distillation via Multi-Teacher Feature Ensemble
    Ye, Xin
    Jiang, Rongxin
    Tian, Xiang
    Zhang, Rui
    Chen, Yaowu
    IEEE Signal Processing Letters, 2024, 31 : 566 - 570