SAKD: Sparse attention knowledge distillation

被引:3
作者
Guo, Zhen [1 ,2 ]
Zhang, Pengzhou [1 ]
Liang, Peng [2 ]
机构
[1] Commun Univ China, State Key Lab Media Convergence & Commun, Dingfuzhuang East St 1, Beijing 100024, Peoples R China
[2] China Unicom Smart City Res Inst, Shoutinanlu 9, Beijing 100024, Peoples R China
关键词
Knowledge distillation; Attention mechanisms; Sparse attention mechanisms;
D O I
10.1016/j.imavis.2024.105020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning techniques have gained significant interest due to their success in large model scenarios. However, large models often require massive computational resources, which can challenge end devices with limited storage capabilities. Transferring knowledge from big to small models and achieving similar results with limited resources requires further research. Knowledge distillation techniques, which involve using teacher-student models to migrate large model capabilities to small models, have been widely used in model compression and knowledge transfer. In this paper, a novel knowledge distillation approach is proposed, which utilizes the sparse attention mechanism (SAKD). SAKD computes attention using student features as queries and teacher features as key values and performs sparse attention values by random deactivation. Then, this sparse attention value is used to reweight the feature distance of each teacher-student feature pair to avoid negative transfer. Comprehensive experiments demonstrate the effectiveness and generality of our approach. Moreover, our SAKD method outperforms previous state-of-the-art methods on image classification tasks.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Knowledge Distillation With Feature Self Attention
    Park, Sin-Gu
    Kang, Dong-Joong
    IEEE ACCESS, 2023, 11 : 34554 - 34562
  • [2] Structured Attention Knowledge Distillation for Lightweight Networks
    Gu Xiaowei
    Hui, Tian
    Dai Zhongjian
    PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 1726 - 1730
  • [3] A-A KD: Attention and Activation Knowledge Distillation
    Gou, Aorui
    Liu, Chao
    Sun, Heming
    Zeng, Xiaoyang
    Fan, Yibo
    2021 IEEE SEVENTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2021), 2021, : 57 - 60
  • [4] Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms
    Linfeng Li
    Weixing Su
    Fang Liu
    Maowei He
    Xiaodan Liang
    Neural Processing Letters, 2023, 55 : 6165 - 6180
  • [5] Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms
    Li, Linfeng
    Su, Weixing
    Liu, Fang
    He, Maowei
    Liang, Xiaodan
    NEURAL PROCESSING LETTERS, 2023, 55 (05) : 6165 - 6180
  • [6] What Can Attention Module Do in Knowledge Distillation?
    Li, Xiaolin
    Huang, Bowen
    Xu, Gang
    Chen, Zhuohao
    2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 196 - 200
  • [7] Hierarchical Multi-Attention Transfer for Knowledge Distillation
    Gou, Jianping
    Sun, Liyuan
    Yu, Baosheng
    Wan, Shaohua
    Tao, Dacheng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (02)
  • [8] Dynamic Refining Knowledge Distillation Based on Attention Mechanism
    Peng, Xuan
    Liu, Fang
    PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT II, 2022, 13630 : 45 - 58
  • [9] Sparse Mixture of Experts Language Models Excel in Knowledge Distillation
    Xu, Haiyang
    Liu, Haoxiang
    Gong, Wei
    Wang, Hai
    Deng, Xianjun
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 80 - 91
  • [10] Knowledge Distillation with Attention for Deep Transfer Learning of Convolutional Networks
    Li, Xingjian
    Xiong, Haoyi
    Chen, Zeyu
    Huan, Jun
    Liu, Ji
    Xu, Cheng-Zhong
    Dou, Dejing
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (03)