UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

被引:377
作者
Dai, Zhigang [1 ,2 ,3 ]
Cai, Bolun [2 ]
Lin, Yugeng [2 ]
Chen, Junying [1 ,3 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
[2] Tencent Wechat AI, Shenzhen, Peoples R China
[3] South China Univ Technol, Minist Educ, Key Lab Big Data & Intelligent Robot, Guangzhou, Peoples R China
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
关键词
D O I
10.1109/CVPR46437.2021.00165
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Object detection with transformers (DETR) reaches competitive performance with Faster R-CNN via a transformer encoder-decoder architecture. Inspired by the great success of pre-training transformers in natural language processing, we propose a pretext task named random query patch detection to Unsupervisedly Pre-train DETR (UP-DETR) for object detection. Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the original image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade off classification and localization preferences in the pretext task, we freeze the CNN backbone and propose a patch feature reconstruction branch which is jointly optimized with patch detection. (2) To perform multi-query localization, we introduce UP-DETR from single-query patch and extend it to multi-query patches with object query shuffle and attention mask. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation.
引用
收藏
页码:1601 / 1610
页数:10
相关论文
共 48 条
[1]  
[Anonymous], 2019, ADV NEURAL INFORM PR
[2]  
[Anonymous], 2020, ADV NEURAL INFORM PR, DOI DOI 10.1029/2019PA003809
[3]  
[Anonymous], 2010, International journal of computer vision, DOI DOI 10.1007/s11263-009-0275-4
[4]  
Asano YM, 2019, Proc. ICLR
[5]   Fully-Convolutional Siamese Networks for Object Tracking [J].
Bertinetto, Luca ;
Valmadre, Jack ;
Henriques, Joao F. ;
Vedaldi, Andrea ;
Torr, Philip H. S. .
COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 :850-865
[6]  
Cao Y., 2020, Advances in Neural Information Processing Systems, V33
[7]  
Carion Nicolas, 2020, EUROPEAN C COMPUTER
[8]   Deep Clustering for Unsupervised Learning of Visual Features [J].
Caron, Mathilde ;
Bojanowski, Piotr ;
Joulin, Armand ;
Douze, Matthijs .
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156
[9]  
Chen T, 2020, PR MACH LEARN RES, V119
[10]   Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation [J].
Chen, Xiaocong ;
Huang, Chaoran ;
Yao, Lina ;
Wang, Xianzhi ;
Liu, Wei ;
Zhang, Wenjie .
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,