SAKD: Sparse attention knowledge distillation

被引:4
作者
Guo, Zhen [1 ,2 ]
Zhang, Pengzhou [1 ]
Liang, Peng [2 ]
机构
[1] Commun Univ China, State Key Lab Media Convergence & Commun, Dingfuzhuang East St 1, Beijing 100024, Peoples R China
[2] China Unicom Smart City Res Inst, Shoutinanlu 9, Beijing 100024, Peoples R China
关键词
Knowledge distillation; Attention mechanisms; Sparse attention mechanisms;
D O I
10.1016/j.imavis.2024.105020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning techniques have gained significant interest due to their success in large model scenarios. However, large models often require massive computational resources, which can challenge end devices with limited storage capabilities. Transferring knowledge from big to small models and achieving similar results with limited resources requires further research. Knowledge distillation techniques, which involve using teacher-student models to migrate large model capabilities to small models, have been widely used in model compression and knowledge transfer. In this paper, a novel knowledge distillation approach is proposed, which utilizes the sparse attention mechanism (SAKD). SAKD computes attention using student features as queries and teacher features as key values and performs sparse attention values by random deactivation. Then, this sparse attention value is used to reweight the feature distance of each teacher-student feature pair to avoid negative transfer. Comprehensive experiments demonstrate the effectiveness and generality of our approach. Moreover, our SAKD method outperforms previous state-of-the-art methods on image classification tasks.
引用
收藏
页数:8
相关论文
共 51 条
[31]  
Rong Yu, 2020, INT C LEARNING REPRE, DOI [10.48550/arXiv.1907.10903, DOI 10.48550/ARXIV.1907.10903]
[32]  
Srivastava N, 2014, J MACH LEARN RES, V15, P1929
[33]  
Tian Y., 2020, ICLR
[34]   Similarity-Preserving Knowledge Distillation [J].
Tung, Frederick ;
Mori, Greg .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :1365-1374
[35]  
Vaswani A, 2017, ADV NEUR IN, V30
[36]  
Wan L., 2013, ICML, V28, P1058
[37]   Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks [J].
Wang, Lin ;
Yoon, Kuk-Jin .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (06) :3048-3068
[38]  
Wang N., 2019, ICLR
[39]   Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [J].
Wang, Wenhai ;
Xie, Enze ;
Li, Xiang ;
Fan, Deng-Ping ;
Song, Kaitao ;
Liang, Ding ;
Lu, Tong ;
Luo, Ping ;
Shao, Ling .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :548-558
[40]   End-to-End Video Instance Segmentation with Transformers [J].
Wang, Yuqing ;
Xu, Zhaoliang ;
Wang, Xinlong ;
Shen, Chunhua ;
Cheng, Baoshan ;
Shen, Hao ;
Xia, Huaxia .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8737-8746