UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

被引：377

作者：

Dai, Zhigang ^{[1
,2
,3
]}

Cai, Bolun ^{[2
]}

Lin, Yugeng ^{[2
]}

Chen, Junying ^{[1
,3
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

[2] Tencent Wechat AI, Shenzhen, Peoples R China

[3] South China Univ Technol, Minist Educ, Key Lab Big Data & Intelligent Robot, Guangzhou, Peoples R China

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

关键词：

D O I：

10.1109/CVPR46437.2021.00165

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Object detection with transformers (DETR) reaches competitive performance with Faster R-CNN via a transformer encoder-decoder architecture. Inspired by the great success of pre-training transformers in natural language processing, we propose a pretext task named random query patch detection to Unsupervisedly Pre-train DETR (UP-DETR) for object detection. Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the original image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade off classification and localization preferences in the pretext task, we freeze the CNN backbone and propose a patch feature reconstruction branch which is jointly optimized with patch detection. (2) To perform multi-query localization, we introduce UP-DETR from single-query patch and extend it to multi-query patches with object query shuffle and attention mask. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation.

引用

页码：1601 / 1610

页数：10

共 48 条

[1]

[Anonymous], 2019, ADV NEURAL INFORM PR

[2]

[Anonymous], 2020, ADV NEURAL INFORM PR, DOI DOI 10.1029/2019PA003809

[3]

[Anonymous], 2010, International journal of computer vision, DOI DOI 10.1007/s11263-009-0275-4

[4]

Asano YM, 2019, Proc. ICLR

[5] Fully-Convolutional Siamese Networks for Object Tracking [J].

Bertinetto, Luca ;

Valmadre, Jack ;

Henriques, Joao F. ;

Vedaldi, Andrea ;

Torr, Philip H. S. .

COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 :850-865

[6]

Cao Y., 2020, Advances in Neural Information Processing Systems, V33

[7]

Carion Nicolas, 2020, EUROPEAN C COMPUTER

[8] Deep Clustering for Unsupervised Learning of Visual Features [J].

Caron, Mathilde ;

Bojanowski, Piotr ;

Joulin, Armand ;

Douze, Matthijs .

COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156

[9]

Chen T, 2020, PR MACH LEARN RES, V119

[10] Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation [J].

Chen, Xiaocong ;

Huang, Chaoran ;

Yao, Lina ;

Wang, Xianzhi ;

Liu, Wei ;

Zhang, Wenjie .

2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,

← 1 2 3 4 5 →