Focal DETR: Target-Aware Token Design for Transformer-Based Object Detection

被引:11
作者
Xie, Tianming [1 ,2 ]
Zhang, Zhonghao [1 ,2 ]
Tian, Jing [3 ]
Ma, Lihong [1 ,2 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510640, Peoples R China
[2] Natl Res Ctr Mobile Ultrason Detect, Guangzhou 510640, Peoples R China
[3] Natl Univ Singapore, Inst Syst Sci, Singapore 119615, Singapore
关键词
object detection; self attention; query-key similarity; vision transformer;
D O I
10.3390/s22228686
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
In this paper, we propose a novel target-aware token design for transformer-based object detection. To tackle the target attribute diffusion challenge of transformer-based object detection, we propose two key components in the new target-aware token design mechanism. Firstly, we propose a target-aware sampling module, which forces the sampling patterns to converge inside the target region and obtain its representative encoded features. More specifically, a set of four sampling patterns are designed, including small and large patterns, which focus on the detailed and overall characteristics of a target, respectively, as well as the vertical and horizontal patterns, which handle the object's directional structures. Secondly, we propose a target-aware key-value matrix. This is a unified, learnable, feature-embedding matrix which is directly weighted on the feature map to reduce the interference of non-target regions. With such a new design, we propose a new variant of the transformer-based object-detection model, called Focal DETR, which achieves superior performance over the state-of-the-art transformer-based object-detection models on the COCO object-detection benchmark dataset. Experimental results demonstrate that our Focal DETR achieves a 44.7 AP in the coco2017 test set, which is 2.7 AP and 0.9 AP higher than the DETR and deformable DETR using the same training strategy and the same feature-extraction network.
引用
收藏
页数:18
相关论文
共 29 条
[1]  
[Anonymous], 2018, P IEEE C COMPUTER VI
[2]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[3]  
Bochkovskiy A., 2020, PREPRINT
[4]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[5]   Recurrent Glimpse-based Decoder for Detection with Transformer [J].
Chen, Zhe ;
Zhang, Jing ;
Tao, Dacheng .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5250-5259
[6]  
Dai JF, 2016, ADV NEUR IN, V29
[7]   UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [J].
Dai, Zhigang ;
Cai, Bolun ;
Lin, Yugeng ;
Chen, Junying .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :1601-1610
[8]   Fast R-CNN [J].
Girshick, Ross .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448
[9]   Rich feature hierarchies for accurate object detection and semantic segmentation [J].
Girshick, Ross ;
Donahue, Jeff ;
Darrell, Trevor ;
Malik, Jitendra .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :580-587
[10]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778