Improved deep learning image classification algorithm based on Swin Transformer V2

被引:0
作者
Wei J. [1 ]
Chen J. [1 ]
Wang Y. [2 ]
Luo H. [1 ]
Li W. [1 ]
机构
[1] College of Information Engineering, Sichuan Agricultural University, Sichuan, Ya’an
[2] College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Sichuan, Ya’an
关键词
Attention mechanism; Convolutional neural networks; Image classification; Transformer;
D O I
10.7717/PEERJ-CS.1665
中图分类号
学科分类号
摘要
While convolutional operation effectively extracts local features, their limited receptive fields make it challenging to capture global dependencies. Transformer, on the other hand, excels at global modeling and effectively captures global dependencies. However, the self-attention mechanism used in Transformers lacks a local mechanism for information exchange within specific regions. This article attempts to leverage the strengths of both Transformers and convolutional neural networks (CNNs) to enhance the Swin Transformer V2 model. By incorporating both convolutional operation and self-attention mechanism, the enhanced model combines the local information-capturing capability of CNNs and the long-range dependency-capturing ability of Transformers. The improved model enhances the extraction of local information through the introduction of the Swin Transformer Stem, inverted residual feed-forward network, and Dual-Branch Downsampling structure. Subsequently, it models global dependencies using the improved self-attention mechanism. Additionally, downsampling is applied to the attention mechanism’s Q and K to reduce computational and memory overhead. Under identical training conditions, the proposed method significantly improves classification accuracy on multiple image classification datasets, showcasing more robust generalization capabilities. © Copyright 2023 Wei et al.
引用
收藏
相关论文
共 40 条
[1]  
Bello I, Zoph B, Le Q, Vaswani A, Shlens J., Attention augmented convolutional networks, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3286-3295, (2019)
[2]  
Chen CFR, Fan Q, Panda R., Crossvit: cross-attention multi-scale vision transformer for image classification, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357-366, (2021)
[3]  
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N., An image is worth 16x16 words: transformers for image recognition at scale, (2020)
[4]  
d'Ascoli S, Touvron H, Leavitt ML, Morcos AS, Biroli G, Sagun L., Convit: improving vision transformers with soft convolutional inductive biases, International Conference on Machine Learning, (2021)
[5]  
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C., Multiscale vision transformers, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824-6835, (2021)
[6]  
Guo J, Han K, Wu H, Tang Y, Chen X, Wang Y, Xu C., CMT: convolutional neural networks meet vision transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165-12175, (2022)
[7]  
Guo MH, Lu CZ, Liu ZN, Cheng MM, Hu SM., Visual attention network, Computational Visual Media, 9, 4, pp. 733-752, (2022)
[8]  
Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y., Transformer in transformer, Advances in Neural Information Processing Systems, 34, pp. 15908-15919, (2021)
[9]  
He K, Zhang X, Ren S, Sun J., Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, (2016)
[10]  
Howard J, Gugger S., Fastai: a layered API for deep learning, Information, 11, 2, (2020)