YOLO-DAW: Object detection model based on dual attention mechanism within windows

被引:0
作者
Yin Z. [1 ]
Shao J. [1 ]
Zhang N. [1 ]
机构
[1] Key Laboratory of Measurement and Control of Complex Systems of Ministry of Education, Southeast University, Nanjing 210096, China Intelligent Transportation System Research Center of Ministry of Education, Southeast University
来源
Dongnan Daxue Xuebao (Ziran Kexue Ban)/Journal of Southeast University (Natural Science Edition) | 2023年 / 53卷 / 04期
关键词
attention mechanism; future fusion; object detection; upsampling based on fully connected layer;
D O I
10.3969/j.issn.1001-0505.2023.04.019
中图分类号
学科分类号
摘要
To introduce the attention mechanism into the you only look once (YOLO) model, thereby improving the feature fusion ability and detection accuracy of the algorithm, an improved YOLOv5 model (YOLO-DAW) based on the dual attention mechanism inside the window is proposed. In the neck layer, when the model performs feature fusion in the feature pyramid network and path aggregation network, channel attention and spatial attention mechanisms are introduced respectively, and the calculation of the attention mechanism is limited to windows of different sizes to reduce computational complexity. Attention mechanisms of different nature can provide global feature information with a larger receptive field for forward features, which greatly enhances the model's ability to understand different features. Experimental results show that the mAP50 of the model on the public dataset PASCAL VOC2012 and SODA10M reaches 68. 6% and 51. 9%, respectively. Compared with YOLOv5m with the similar parameters, YOLO-DAW has a 1. 2% lead in both PASCAL VOC2012 and SODA10M. The improved model can better integrate local features and global features, making it meet the detection requirements in more complex scenes. © 2023 Southeast University. All rights reserved.
引用
收藏
页码:718 / 724
页数:6
相关论文
共 26 条
  • [1] Krizhevsky A, Sutskever I, Hinton G E., ImageNet classification with deep convolutional neural networks, Communications of the ACM, 60, 6, pp. 84-90, (2017)
  • [2] Girshick R, Donahue J, Darrell T, Et al., Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 580-587, (2014)
  • [3] Girshick R., Fast R-CNN, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1440-1448, (2015)
  • [4] Ren S, He K, Girshick R, Et al., Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 6, pp. 1137-1149, (2017)
  • [5] He K, Gkioxari G, Dollar P, Et al., Mask R-CNN, IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 2, pp. 386-397, (2017)
  • [6] Redmon J, Divvala S, Girshick R, Et al., You only look once: Unified, real-time object detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779-788, (2016)
  • [7] Liu W, Anguelov D, Erhan D, Et al., SSD: Single shot MultiBox detector[C], Proceedings of the European Conference on Computer Vision, pp. 21-37, (2016)
  • [8] Lin T Y, Goyal P, Girshick R, Et al., Focal loss for dense object detection [J], IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 2, pp. 318-327, (2020)
  • [9] Dosovitskiy A, Beyer L, Kolesnikov A, Et al., An image is worth 16x16 words
  • [10] Transformers for image recognition at scale [EB/OL]