AugDETR: Improving Multi-scale Learning for Detection Transformer

被引:2
作者
Dong, Jinpeng
Lin, Yutong
Li, Chen
Zhou, Sanping
Zheng, Nanning [1 ]
机构
[1] Xi An Jiao Tong Univ, Natl Key Lab Human Machine Hybrid Augmented Intel, Natl Engn Res Ctr Visual Informat & Applicat, Xian, Peoples R China
来源
COMPUTER VISION - ECCV 2024, PT XXIV | 2025年 / 15082卷
基金
中国国家自然科学基金;
关键词
Object detection; Detection transformer; Hybrid attention; Multi-level encoder;
D O I
10.1007/978-3-031-72691-0_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current end-to-end detectors typically exploit transformers to detect objects and show promising performance. Among them, Deformable DETR is a representative paradigm that effectively exploits multi-scale features. However, small local receptive fields and limited query-encoder interactions weaken multi-scale learning. In this paper, we analyze local feature enhancement and multi-level encoder exploitation for improved multi-scale learning and construct a novel detection transformer detector named Augmented DETR (AugDETR) to realize them. Specifically, AugDETR consists of two components: Hybrid Attention Encoder and Encoder-Mixing Cross-Attention. Hybrid Attention Encoder enlarges the receptive field of the deformable encoder and introduces global context features to enhance feature representation. Encoder-Mixing Cross-Attention adaptively leverages multi-level encoders based on query features for more discriminative object features and faster convergence. By combining AugDETR with DETR-based detectors such as DINO, AlignDETR, DDQ, our models achieve performance improvements of 1.2, 1.1, and 1.0 AP in the COCO under the ResNet-50-4scale and 12 epochs setting, respectively.
引用
收藏
页码:238 / 255
页数:18
相关论文
共 51 条
[1]   Cascade R-CNN: Delving into High Quality Object Detection [J].
Cai, Zhaowei ;
Vasconcelos, Nuno .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6154-6162
[2]  
Cai Z, 2024, Arxiv, DOI [arXiv:2304.07527, DOI 10.48550/ARXIV.2304.07527]
[3]  
Cao XP, 2022, AAAI CONF ARTIF INTE, P185
[4]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[5]  
Chen Q, 2023, Arxiv, DOI arXiv:2207.13085
[6]   Dynamic DETR: End-to-End Object Detection with Dynamic Attention [J].
Dai, Xiyang ;
Chen, Yinpeng ;
Yang, Jianwei ;
Zhang, Pengchuan ;
Yuan, Lu ;
Zhang, Lei .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :2968-2977
[7]   Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs [J].
Ding, Xiaohan ;
Zhang, Xiangyu ;
Han, Jungong ;
Ding, Guiguang .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :11953-11965
[8]  
Dong JP, 2022, AAAI CONF ARTIF INTE, P534
[9]   CenterNet: Keypoint Triplets for Object Detection [J].
Duan, Kaiwen ;
Bai, Song ;
Xie, Lingxi ;
Qi, Honggang ;
Huang, Qingming ;
Tian, Qi .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6568-6577
[10]   Fast Convergence of DETR with Spatially Modulated Co-Attention [J].
Gao, Peng ;
Zheng, Minghang ;
Wang, Xiaogang ;
Dai, Jifeng ;
Li, Hongsheng .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :3601-3610