Open Vocabulary Object Detection with Pseudo Bounding-Box Labels

被引:46
作者
Gao, Mingfei [1 ]
Xing, Chen [1 ]
Niebles, Juan Carlos [1 ]
Li, Junnan [1 ]
Xu, Ran [1 ]
Liu, Wenhao [1 ]
Xiong, Caiming [1 ]
机构
[1] Salesforce Res, Palo Alto, CA 94301 USA
来源
COMPUTER VISION, ECCV 2022, PT X | 2022年 / 13670卷
关键词
Open vocabulary detection; Pseudo bounding-box labels;
D O I
10.1007/978-3-031-20080-9_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite great progress in object detection, most existing methods work only on a limited set of object categories, due to the tremendous human effort needed for bounding-box annotations of training data. To alleviate the problem, recent open vocabulary and zero-shot detection methods attempt to detect novel object categories beyond those seen during training. They achieve this goal by training on a predefined base categories to induce generalization to novel objects. However, their potential is still constrained by the small set of base categories available for training. To enlarge the set of base classes, we propose a method to automatically generate pseudo bounding-box annotations of diverse objects from large-scale image-caption pairs. Our method leverages the localization ability of pre-trained vision-language models to generate pseudo bounding-box labels and then directly uses them for training object detectors. Experimental results show that our method outperforms the state-of-the-art open vocabulary detector by 8% AP on COCO novel categories, by 6.3% AP on PASCAL VOC, by 2.3% AP on Objects365 and by 2.8% AP on LVIS.
引用
收藏
页码:266 / 282
页数:17
相关论文
共 36 条
[1]  
[Anonymous], 2010, International journal of computer vision, DOI DOI 10.1007/s11263-009-0275-4
[2]   Zero-Shot Object Detection [J].
Bansal, Ankan ;
Sikka, Karan ;
Sharma, Gaurav ;
Chellappa, Rama ;
Divakaran, Ajay .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :397-414
[3]   Weakly Supervised Deep Detection Networks [J].
Bilen, Hakan ;
Vedaldi, Andrea .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2846-2854
[4]  
Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
[5]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arXiv.1810.04805]
[6]   The PASCAL Visual Object Classes Challenge: A Retrospective [J].
Everingham, Mark ;
Eslami, S. M. Ali ;
Van Gool, Luc ;
Williams, Christopher K. I. ;
Winn, John ;
Zisserman, Andrew .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2015, 111 (01) :98-136
[7]   C-WSL: Count-Guided Weakly Supervised Localization [J].
Gao, Mingfei ;
Li, Ang ;
Yu, Ruichi ;
Morariu, Vlad, I ;
Davis, Larry S. .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :155-171
[8]   Fast R-CNN [J].
Girshick, Ross .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448
[9]  
Gu X., 2021, arXiv
[10]   LVIS: A Dataset for Large Vocabulary Instance Segmentation [J].
Gupta, Agrim ;
Dollar, Piotr ;
Girshick, Ross .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5351-5359