Rethinking the multi-scale feature hierarchy in object detection transformer (DETR)

被引：8

作者：

Liu, Fanglin ^{[1
]}

Zheng, Qinghe ^{[1
]}

Tian, Xinyu ^{[1
]}

Shu, Feng ^{[2
]}

Jiang, Weiwei ^{[3
]}

Wang, Miaohui ^{[4
]}

Elhanashi, Abdussalam ^{[5
]}

Saponara, Sergio ^{[5
]}

机构：

[1] Shandong Management Univ, Sch Intelligent Engn, Jinan 250357, Peoples R China

[2] Hainan Univ, Sch Informat & Commun Engn, Haikou 570228, Peoples R China

[3] Beijing Univ Posts & Telecommun, Sch Informat & Commun Engn, Beijing 100876, Peoples R China

[4] Shenzhen Univ, State Key Lab Radio Frequency Heterogeneous Integr, Shenzhen 518060, Peoples R China

[5] Univ Pisa, Sch Informat Engn, I-56122 Pisa, Italy

来源：

APPLIED SOFT COMPUTING | 2025年 / 175卷

关键词：

Object detection; Deep neural network; Transformer; Detection Transformer (DETR); Multi-branch structure;

D O I：

10.1016/j.asoc.2025.113081

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Detection Transformer (DETR) has emerged as the dominant paradigm in the field of object detection due to its end-to-end architectural design. Researchers have explored various aspects of DETR, including its structure, pre-training strategies, attention mechanisms, and query embeddings, achiving significant progress. However, high computational costs limit the efficient use of multi-scale feature maps and hinder the full exploitation of complex multi-branch structures. We examine the negative impact of multi-scale features on the computational cost of DETRs and find that introducing long sequence data to the encoder is suboptimal. In this work, we aim to further push the boundaries of DETR's performance and efficiency from the model structure perspective, thus developing the fusion detection Transformer (F-DETR) with heterogeneous scale multi-branch structure. To the best of our knowledge, this is the first explicit attempt to integrate multi-scale features into the end-to-end DETR structure. Specifically, we propose a multi-branch structure to simultaneously utilize feature maps at different levels, facilitating the interaction of local and global features. Additionally, we select certain joint latent variables from the interactive information flow to initialize the object container, a technique commonly used in query-based detectors. Experimental results show that F-DETR achieves a 43.9 % AP using 36 training epochs on the popular public COCO dataset. Furthermore, our approach demonstrates a better trade-off between accuracy and complexity compared to the original DETR.

引用

页数：16

共 83 条

[1]

Cao XP, 2022, AAAI CONF ARTIF INTE, P185

[2] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[3]

Chen Q, 2024, Arxiv, DOI arXiv:2406.03459

[4]

Chen Y., 2024, IEEE J. Biomed. Health Inform.

[5] Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases [J].

Chen, Yifei ;

Zhang, Chenyan ;

Chen, Ben ;

Huang, Yiyu ;

Sun, Yifei ;

Wang, Changmiao ;

Fu, Xianjun ;

Dai, Yuxing ;

Qin, Feiwei ;

Peng, Yong ;

Gao, Yu .

COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 170

[6]

Chen YP, 2017, ADV NEUR IN, V30

[7] Recurrent Glimpse-based Decoder for Detection with Transformer [J].

Chen, Zhe ;

Zhang, Jing ;

Tao, Dacheng .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5250-5259

[8] Xception: Deep Learning with Depthwise Separable Convolutions [J].

Chollet, Francois .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1800-1807

[9] Dynamic DETR: End-to-End Object Detection with Dynamic Attention [J].

Dai, Xiyang ;

Chen, Yinpeng ;

Yang, Jianwei ;

Zhang, Pengchuan ;

Yuan, Lu ;

Zhang, Lei .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :2968-2977

[10] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [J].

Dai, Zhigang ;

Cai, Bolun ;

Lin, Yugeng ;

Chen, Junying .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :1601-1610

← 1 2 3 4 5 6 7 8 9 →