Vision Transformer Based on Reconfigurable Gaussian Self-attention

被引:0
作者
Zhao L. [1 ,2 ]
Zhou J.-K. [1 ]
机构
[1] College of Information and Control Engineering, Xi'an University of Architecture and Technology, Xi'an
[2] Shaanxi Provincial Key Laboratory of Geotechnical and Underground Space Engineering, Xi'an
来源
Zidonghua Xuebao/Acta Automatica Sinica | 2023年 / 49卷 / 09期
基金
中国国家自然科学基金;
关键词
Gaussian weight recombination (GWR); image classification; local self-attention; objection detection; Transformer;
D O I
10.16383/j.aas.c220715
中图分类号
学科分类号
摘要
In the current vision Transformer's local self-attention, the existing strategy cannot establish the information flow between all windows, resulting in the lack of context modeling ability. To solve this problem, this paper proposes a new local self-attention mechanism shuffled and Gaussian window-multi-head self-attention (SGW-MSA) based on the strategy of Gaussian weight recombination (GWR), which combines three different local self-attention forces, and reconstructs the feature map through GWR strategy, and extracts image features from the reconstructed feature map. The interaction of all windows is established to capture richer context information. This paper designs the overall architecture of SGWin Transformer based on SGW-MSA. The experimental results show that the accuracy of this algorithm in the mini-imagenet image classification dataset is 5.1% higher than that in the Swin Transformer, the accuracy in the CIFAR10 image classification experiment is 5.2% higher than that in the Swin Transformer, and the mAP using the Mask R-CNN and Cascade R-CNN object detection frameworks on the MS COCO dataset are 5.5% and 5.1% higher than that in the Swin Transformer, respectively. Compared with other models based on local self-attention, it has stronger competitiveness in the case of similar parameters. © 2023 Science Press. All rights reserved.
引用
收藏
页码:1976 / 1988
页数:12
相关论文
共 47 条
[1]  
Jiang Hong-Yi, Wang Yong-Juan, Kang Jin-Yu, A survey of object detection models and its optimiza-tion methods, Acta Automatica Sinica, 47, 6, pp. 1232-1255, (2021)
[2]  
Yin Hong-Peng, Chen Bo, Chai Yi, Liu Zhao-Dong, Vision-based object detection and tracking: A review, Acta Automatica Sinica, 42, 10, pp. 1466-1489, (2016)
[3]  
Xu Peng-Bin, Zhai An-Guo, Wang Kun-Feng, Li Da-Zi, A survey of panoptic segmentation methods, Acta Automatica Sinica, 47, 3, pp. 549-568, (2021)
[4]  
Krizhevsky A, Sutskever I, Hinton G E., ImageNet classification with deep convolutional neural networks, Communications of the ACM, 60, 6, (2017)
[5]  
Simonyan K, Zisserman A., Very deep convolutional networks for large-scale image recognition, (2014)
[6]  
Huang G, Liu Z, Laurens V D M., Densely connected convolutional networks, Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700-4708, (2017)
[7]  
He K, Zhang X, Ren S., Deep residual tearning for image recognition, Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, (2016)
[8]  
Xie S, Girshick R, Dollar P., Aggregated residual transformations for deep neural networks, Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492-1500, (2017)
[9]  
Szegedy C, Liu W, Jia Y., Going deeper with convolutions, Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, (2015)
[10]  
Tan M, Le Q V., EfficientNet: Rethinking model scaling for convolutional neural networks, Proceedings of the 36th International Conference on Machine Learning, pp. 6105-6114, (2019)