Attention Module Based on Feature Similarity and Feature Normalization

被引：0

作者：

Du, Qiliang ^{[1
,2
,3
]}

Wang, Yimin ^{[1
]}

Tian, Lianfang ^{[1
,4
,5
]}

机构：

[1] School of Automation Science and Engineering, South China University of Technology, Guangdong, Guangzhou

[2] China-Singapore International Joint Research Institute, South China University of Technology, Guangdong, Guangzhou

[3] Key Laboratory of Autonomous Systems and Network Control, The Ministry of Education, South China University of Technology, Guangdong, Guangzhou

[4] Research Institute of Modern Industrial Innovation, South China University of Technology, Guangdong, Zhuhai

[5] Engineering Center of Guangdong Development and Reform Commission, South China University of Technology, Guangdong, Guangzhou

来源：

Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science) | 2024年 / 52卷 / 07期

关键词：

attention module; computer vision; convolutional neural network; feature normalization; feature similarity;

D O I：

10.12141/j.issn.1000-565X.230313

中图分类号：

学科分类号：

摘要：

In recent years, attention mechanisms have achieved great success in the fields of image classification, object detection and semantic segmentation. However, most existing attention mechanisms can only achieve feature fusion in channel or spatial dimensions, which greatly limits the flexibility of attention mechanisms to change in channel and spatial dimensions and cannot fully utilize feature information. To address this issue, this paper proposes a convolutional neural network attention module based on feature similarity and feature normalization (FSNAM), which can utilize the characteristic information of both channel domain and spatial domain. FSNAM consists of a feature similarity module (FSM) and a feature normalization module (FNM). FSM generates a two-dimension feature similarity weight map using the channel feature information and local spatial feature information of the input feature map, while FNM generates a three-dimension feature normalization weight map using the global spatial feature information of the input feature map. The weight maps generated by FSM and FNM are fused to generate a three-dimension attention weight map to achieve the fusion of channel feature information and spatial feature information. Moreover, to demonstrate the feasibility and effectiveness of FSNAM, ablation experiments are conducted. The results show that, for image classification tasks, FSNAM significantly outperforms other mainstream attention modules in improving the performance of the classification network on CIFAR dataset; for object detection tasks, the object detection network using FSNAM improves the detection accuracy of small and medium-sized objects in VOC dataset by 3. 9 and 1. 2 points of percentage, respectively; and, for semantic segmentation tasks, FSNAM can significantly improve the performance of HRNet model, and helps to achieve an average pixel accuracy increase of the model on SBD dataset of 0. 58 points of percentage. © 2024 South China University of Technology. All rights reserved.

引用

页码：62 / 71

页数：9

共 29 条

[1] DENG J, DONG W, SOCHER R, Et al., Imagenet:a large-scale hierarchical image database [C], Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, (2009)
[2] LIN T Y, MAIRE M, BELONGIE S, Et al., Microsoft COCO:common objects in context [C], Proceedings of the 13th European Conference on Computer Vision, pp. 740-755, (2014)
[3] HE K, ZHANG X, REN S, Et al., Deep residual learning for image recognition [C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, (2016)
[4] HUANG G, LIU Z, VAN DER MAATEN L, Et al., Densely connected convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700-4708, (2017)
[5] HU J, SHEN L, SUN G., Squeeze-and-excitation networks [C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132-7141, (2018)
[6] DAI T, CAI J, ZHANG Y, Et al., Second-order attention network for single image super-resolution [C], Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11065-11074, (2019)
[7] ZHAO H, KONG X, HE J, Et al., Efficient image super-resolution using pixel attention [C], Proceedings of the European Conference on Computer Vision, pp. 56-72, (2020)
[8] MNIH V, HEESS N, GRAVES A., Recurrent models of visual attention [J], Advances in Neural Information Processing Systems, 2, 12, pp. 2204-2212, (2014)
[9] BA J, MNIH V, KAVUKCUOGLU K., Multiple object recognition with visual attention
[10] XU K, BA J, KIROS R, Et al., Show, attend and tell:neural image caption generation with visual attention [C], Proceedings of the International Conference on Machine Learning, pp. 2048-2057, (2015)

← 1 2 3 →