MOD-YOLO: Multispectral object detection based on transformer dual-stream YOLO

被引：7

作者：

Shao, Yanhua ^{[1
]}

Huang, Qimeng ^{[1
]}

Mei, Yanying ^{[1
]}

Chu, Hongyu ^{[1
,2
]}

机构：

[1] Southwest Univ Sci & Technol, Sch Informat Engn, Mianyang 621000, Peoples R China

[2] Southwest Univ Sci & Technol, Tianfu Inst Res & Innovat, Chengdu 610299, Peoples R China

来源：

PATTERN RECOGNITION LETTERS | 2024年 / 183卷

关键词：

Feature fusion; Multispectral object detection; Transformer; Lightweight model;

D O I：

10.1016/j.patrec.2024.05.001

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multispectral object detection can effectively improve the precision of object detection in low-visibility scenes, which increases the reliability and stability of the object detection application in the open environment. CrossModality Fusion Transformer (CFT) can effectively fuse different spectral information, but this method relies on large models and expensive computing resources. In this paper, we propose multispectral object detection dualstream YOLO (MOD-YOLO), based on Cross Stage Partial CFT (CSP-CFT), to address the issue that prior studies need heavy inference calculations from the recurrent fusing of multispectral features. This network can divide the fused feature map into two parts, respectively for cross stage output and combined with the next stage feature, to achieve the correct speed/memory/precision balance. To further improve the accuracy, SIoU was selected as the loss function. Ultimately, extensive experiments on multiple publicly available datasets demonstrate that our model, which achieves the smallest model size and excellent performance, produces better tradeoffs between accuracy and model size than other popular models.

引用

页码：26 / 34

页数：9

共 28 条

[1]

Cao Y., 2019, P 2019 IEEE 5 INT C, P1965

[2] Multimodal Object Detection via Probabilistic Ensembling [J].

Chen, Yi-Ting ;

Shi, Jinghao ;

Ye, Zelin ;

Mertz, Christoph ;

Ramanan, Deva ;

Kong, Shu .

COMPUTER VISION, ECCV 2022, PT IX, 2022, 13669 :139-158

[3]

F-Team, FREE FLIR THERMAL DA

[4]

Gevorgyan Z, 2022, Arxiv, DOI [arXiv:2205.12740, DOI 10.48550/ARXIV.2205.12740]

[5] A Survey on Vision Transformer [J].

Han, Kai ;

Wang, Yunhe ;

Chen, Hanting ;

Chen, Xinghao ;

Guo, Jianyuan ;

Liu, Zhenhua ;

Tang, Yehui ;

Xiao, An ;

Xu, Chunjing ;

Xu, Yixing ;

Yang, Zhaohui ;

Zhang, Yiman ;

Tao, Dacheng .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) :87-110

[6]

He J., 2021, Neural lnformation Processing Systems, V34, P20230

[7] Speed/accuracy trade-offs for modern convolutional object detectors [J].

Huang, Jonathan ;

Rathod, Vivek ;

Sun, Chen ;

Zhu, Menglong ;

Korattikara, Anoop ;

Fathi, Alireza ;

Fischer, Ian ;

Wojna, Zbigniew ;

Song, Yang ;

Guadarrama, Sergio ;

Murphy, Kevin .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3296-+

[8]

Hwang S, 2015, PROC CVPR IEEE, P1037, DOI 10.1109/CVPR.2015.7298706

[9] Transformers in Vision: A Survey [J].

Khan, Salman ;

Naseer, Muzammal ;

Hayat, Munawar ;

Zamir, Syed Waqas ;

Khan, Fahad Shahbaz ;

Shah, Mubarak .

ACM COMPUTING SURVEYS, 2022, 54 (10S)

[10]

Li CY, 2018, Arxiv, DOI arXiv:1808.04818

← 1 2 3 →