Multi-scale coupled attention for visual object detection

被引：7

作者：

Li, Fei ^{[1
]}

Yan, Hongping ^{[2
]}

Shi, Linsu ^{[1
]}

机构：

[1] China Tower Corp Ltd, 9 Dongran North St, Beijing 100195, Peoples R China

[2] China Univ Geosci, Xueyuan Rd 29, Beijing 100083, Peoples R China

来源：

SCIENTIFIC REPORTS | 2024年 / 14卷 / 01期

关键词：

Attention mechanism; Deep neural networks; Object detection; Self-attention learning; Transformer; YOLO;

D O I：

10.1038/s41598-024-60897-8

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The application of deep neural network has achieved remarkable success in object detection. However, the network structures should be still evolved consistently and tuned finely to acquire better performance. This gears to the continuous demands on high performance in those complex scenes, where multi-scale objects to be detected are located here and there. To this end, this paper proposes a network structure called Multi-Scale Coupled Attention (MSCA) under the framework of self-attention learning with methodologies of importance assessment. Architecturally, it consists of a Multi-Scale Coupled Channel Attention (MSCCA) module, and a Multi-Scale Coupled Spatial Attention (MSCSA) module. Specifically, the MSCCA module is developed to achieve the goal of self-attention learning linearly on the multi-scale channels. In parallel, the MSCSA module is constructed to achieve this goal nonlinearly on the multi-scale spatial grids. The MSCCA and MSSCA modules can be connected together into a sequence, which can be used as a plugin to develop end-to-end learning models for object detection. Finally, our proposed network is compared on two public datasets with 13 classical or state-of-the-art models, including the Faster R-CNN, Cascade R-CNN, RetinaNet, SSD, PP-YOLO, YOLO v3, YOLO v5, YOLO v7, YOLOX, DETR, conditional DETR, UP-DETR and FP-DETR. Comparative experimental results with numerical scores, the ablation study, and the performance behaviour all demonstrate the effectiveness of our proposed model.

引用

页数：19

共 61 条

[1]

Ali A., 2021, Adv. Neural Inf. Process. Syst., V34, P20014, DOI DOI 10.48550/ARXIV.2106.09681

[2]

Bochkovskiy A., 2020, ARXIV, DOI [10.48550/ARXIV.2004.10934, 10.48550/arXiv.2004.10934]

[3] Cascade R-CNN: Delving into High Quality Object Detection [J].

Cai, Zhaowei ;

Vasconcelos, Nuno .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6154-6162

[4]

Carion Nicolas, 2020, End-to-end object detec

[5]

Chen Q., 2022, Fast DETR Training with Group-Wise One-to-Many Assignment

[6] You Only Look One-level Feature [J].

Chen, Qiang ;

Wang, Yingming ;

Yang, Tong ;

Zhang, Xiangyu ;

Cheng, Jian ;

Sun, Jian .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :13034-13043

[7]

Clevert Djork-Arne, 2016, INT C LEARN REPR ICL

[8]

Cristianini J Shawe-Taylor N., 2000, An introduction to support vector machines and other kernel-based learning methods, DOI [10.1017/CBO9780511801389, DOI 10.1017/CBO9780511801389]

[9]

Dai JF, 2016, ADV NEUR IN, V29

[10] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [J].

Dai, Zhigang ;

Cai, Bolun ;

Lin, Yugeng ;

Chen, Junying .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :1601-1610

← 1 2 3 4 5 6 7 →