Focal DETR: Target-Aware Token Design for Transformer-Based Object Detection

被引：11

作者：

Xie, Tianming ^{[1
,2
]}

Zhang, Zhonghao ^{[1
,2
]}

Tian, Jing ^{[3
]}

Ma, Lihong ^{[1
,2
]}

机构：

[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510640, Peoples R China

[2] Natl Res Ctr Mobile Ultrason Detect, Guangzhou 510640, Peoples R China

[3] Natl Univ Singapore, Inst Syst Sci, Singapore 119615, Singapore

来源：

SENSORS | 2022年 / 22卷 / 22期

关键词：

object detection; self attention; query-key similarity; vision transformer;

D O I：

10.3390/s22228686

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

In this paper, we propose a novel target-aware token design for transformer-based object detection. To tackle the target attribute diffusion challenge of transformer-based object detection, we propose two key components in the new target-aware token design mechanism. Firstly, we propose a target-aware sampling module, which forces the sampling patterns to converge inside the target region and obtain its representative encoded features. More specifically, a set of four sampling patterns are designed, including small and large patterns, which focus on the detailed and overall characteristics of a target, respectively, as well as the vertical and horizontal patterns, which handle the object's directional structures. Secondly, we propose a target-aware key-value matrix. This is a unified, learnable, feature-embedding matrix which is directly weighted on the feature map to reduce the interference of non-target regions. With such a new design, we propose a new variant of the transformer-based object-detection model, called Focal DETR, which achieves superior performance over the state-of-the-art transformer-based object-detection models on the COCO object-detection benchmark dataset. Experimental results demonstrate that our Focal DETR achieves a 44.7 AP in the coco2017 test set, which is 2.7 AP and 0.9 AP higher than the DETR and deformable DETR using the same training strategy and the same feature-extraction network.

引用

页数：18

共 29 条

[1]

[Anonymous], 2018, P IEEE C COMPUTER VI

[2] Attention Augmented Convolutional Networks [J].

Bello, Irwan ;

Zoph, Barret ;

Vaswani, Ashish ;

Shlens, Jonathon ;

Le, Quoc V. .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294

[3]

Bochkovskiy A., 2020, PREPRINT

[4] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[5] Recurrent Glimpse-based Decoder for Detection with Transformer [J].

Chen, Zhe ;

Zhang, Jing ;

Tao, Dacheng .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5250-5259

[6]

Dai JF, 2016, ADV NEUR IN, V29

[7] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [J].

Dai, Zhigang ;

Cai, Bolun ;

Lin, Yugeng ;

Chen, Junying .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :1601-1610

[8] Fast R-CNN [J].

Girshick, Ross .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448

[9] Rich feature hierarchies for accurate object detection and semantic segmentation [J].

Girshick, Ross ;

Donahue, Jeff ;

Darrell, Trevor ;

Malik, Jitendra .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :580-587

[10] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

← 1 2 3 →