AugDETR: Improving Multi-scale Learning for Detection Transformer

被引：2

作者：

Dong, Jinpeng

Lin, Yutong

Li, Chen

Zhou, Sanping

Zheng, Nanning ^{[1
]}

机构：

[1] Xi An Jiao Tong Univ, Natl Key Lab Human Machine Hybrid Augmented Intel, Natl Engn Res Ctr Visual Informat & Applicat, Xian, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷

基金：

中国国家自然科学基金;

关键词：

Object detection; Detection transformer; Hybrid attention; Multi-level encoder;

D O I：

10.1007/978-3-031-72691-0_14

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current end-to-end detectors typically exploit transformers to detect objects and show promising performance. Among them, Deformable DETR is a representative paradigm that effectively exploits multi-scale features. However, small local receptive fields and limited query-encoder interactions weaken multi-scale learning. In this paper, we analyze local feature enhancement and multi-level encoder exploitation for improved multi-scale learning and construct a novel detection transformer detector named Augmented DETR (AugDETR) to realize them. Specifically, AugDETR consists of two components: Hybrid Attention Encoder and Encoder-Mixing Cross-Attention. Hybrid Attention Encoder enlarges the receptive field of the deformable encoder and introduces global context features to enhance feature representation. Encoder-Mixing Cross-Attention adaptively leverages multi-level encoders based on query features for more discriminative object features and faster convergence. By combining AugDETR with DETR-based detectors such as DINO, AlignDETR, DDQ, our models achieve performance improvements of 1.2, 1.1, and 1.0 AP in the COCO under the ResNet-50-4scale and 12 epochs setting, respectively.

引用

页码：238 / 255

页数：18

共 51 条

[1] Cascade R-CNN: Delving into High Quality Object Detection [J].

Cai, Zhaowei ;

Vasconcelos, Nuno .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6154-6162

[2]

Cai Z, 2024, Arxiv, DOI [arXiv:2304.07527, DOI 10.48550/ARXIV.2304.07527]

[3]

Cao XP, 2022, AAAI CONF ARTIF INTE, P185

[4] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[5]

Chen Q, 2023, Arxiv, DOI arXiv:2207.13085

[6] Dynamic DETR: End-to-End Object Detection with Dynamic Attention [J].

Dai, Xiyang ;

Chen, Yinpeng ;

Yang, Jianwei ;

Zhang, Pengchuan ;

Yuan, Lu ;

Zhang, Lei .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :2968-2977

[7] Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs [J].

Ding, Xiaohan ;

Zhang, Xiangyu ;

Han, Jungong ;

Ding, Guiguang .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :11953-11965

[8]

Dong JP, 2022, AAAI CONF ARTIF INTE, P534

[9] CenterNet: Keypoint Triplets for Object Detection [J].

Duan, Kaiwen ;

Bai, Song ;

Xie, Lingxi ;

Qi, Honggang ;

Huang, Qingming ;

Tian, Qi .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6568-6577

[10] Fast Convergence of DETR with Spatially Modulated Co-Attention [J].

Gao, Peng ;

Zheng, Minghang ;

Wang, Xiaogang ;

Dai, Jifeng ;

Li, Hongsheng .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :3601-3610

← 1 2 3 4 5 6 →