SAKD: Sparse attention knowledge distillation

被引:3
作者
Guo, Zhen [1 ,2 ]
Zhang, Pengzhou [1 ]
Liang, Peng [2 ]
机构
[1] Commun Univ China, State Key Lab Media Convergence & Commun, Dingfuzhuang East St 1, Beijing 100024, Peoples R China
[2] China Unicom Smart City Res Inst, Shoutinanlu 9, Beijing 100024, Peoples R China
关键词
Knowledge distillation; Attention mechanisms; Sparse attention mechanisms;
D O I
10.1016/j.imavis.2024.105020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning techniques have gained significant interest due to their success in large model scenarios. However, large models often require massive computational resources, which can challenge end devices with limited storage capabilities. Transferring knowledge from big to small models and achieving similar results with limited resources requires further research. Knowledge distillation techniques, which involve using teacher-student models to migrate large model capabilities to small models, have been widely used in model compression and knowledge transfer. In this paper, a novel knowledge distillation approach is proposed, which utilizes the sparse attention mechanism (SAKD). SAKD computes attention using student features as queries and teacher features as key values and performs sparse attention values by random deactivation. Then, this sparse attention value is used to reweight the feature distance of each teacher-student feature pair to avoid negative transfer. Comprehensive experiments demonstrate the effectiveness and generality of our approach. Moreover, our SAKD method outperforms previous state-of-the-art methods on image classification tasks.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] KNOWLEDGE DISTILLATION WITH CATEGORY-AWARE ATTENTION AND DISCRIMINANT LOGIT LOSSES
    Jiang, Lei
    Zhou, Wengang
    Li, Houqiang
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1792 - 1797
  • [22] Application of sparse S transform network with knowledge distillation in seismic attenuation delineation
    Liu, Nai-Hao
    Zhang, Yu-Xin
    Yang, Yang
    Liu, Rong-Chang
    Gao, Jing-Huai
    Zhang, Nan
    PETROLEUM SCIENCE, 2024, 21 (04) : 2345 - 2355
  • [23] Research on knowledge distillation algorithm based on Yolov5 attention mechanism
    Cheng, Shengjie
    Zhou, Peiyong
    Liu, Yu
    Ma, Hongji
    Aysa, Alimjan
    Ubul, Kurban
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 240
  • [24] Alignment Knowledge Distillation for Online Streaming Attention-Based Speech Recognition
    Inaguma, Hirofumi
    Kawahara, Tatsuya
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1371 - 1385
  • [25] Why does Knowledge Distillation work? Rethink its attention and fidelity mechanism
    Guo, Chenqi
    Zhong, Shiwei
    Liu, Xiaofeng
    Feng, Qianli
    Ma, Yinglong
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 262
  • [26] SiamPKHT: Hyperspectral Siamese Tracking Based on Pyramid Shuffle Attention and Knowledge Distillation
    Qian, Kun
    Wang, Shiqing
    Zhang, Shoujin
    Shen, Jianlu
    SENSORS, 2023, 23 (23)
  • [27] Photovoltaic hot spot detection method incorporating knowledge distillation and attention mechanisms
    Hao S.
    Wu Y.
    Ma X.
    Li T.
    Wang H.
    Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2023, 31 (24): : 3640 - 3650
  • [28] Tea Buds Grading Method Based on Multiscale Attention Mechanism and Knowledge Distillation
    Huang H.
    Chen X.
    Han Z.
    Fan Q.
    Zhu Y.
    Hu P.
    Nongye Jixie Xuebao/Transactions of the Chinese Society for Agricultural Machinery, 2022, 53 (09): : 399 - 407and458
  • [29] Improving adversarial robustness using knowledge distillation guided by attention information bottleneck
    Gong, Yuxin
    Wang, Shen
    Yu, Tingyue
    Jiang, Xunzhi
    Sun, Fanghui
    INFORMATION SCIENCES, 2024, 665
  • [30] Multiscale knowledge distillation with attention based fusion for robust human activity recognition
    Yuan, Zhaohui
    Yang, Zhengzhe
    Ning, Hao
    Tang, Xiangyang
    SCIENTIFIC REPORTS, 2024, 14 (01):