GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

被引:0
作者
Shen, Haozhan [1 ]
Zhao, Tiancheng [2 ]
Zhu, Mingwei [1 ]
Yin, Jianwei [1 ]
机构
[1] Zhejiang Univ, Beijing, Peoples R China
[2] Zhejiang Univ, Binjiang Inst, Beijing, Peoples R China
来源
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5 | 2024年
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCO-CO/+/ g datasets, surpassing prior zero-shot state-of-the-art by approximately 28% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP.
引用
收藏
页码:4766 / 4775
页数:10
相关论文
共 47 条
[1]   Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [J].
Changpinyo, Soravit ;
Sharma, Piyush ;
Ding, Nan ;
Soricut, Radu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3557-3567
[2]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[3]  
Chen Z., P IEEE CVF C COMP VI, P10086
[4]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[5]   TransVG: End-to-End Visual Grounding with Transformers [J].
Deng, Jiajun ;
Yang, Zhengyuan ;
Chen, Tianlang ;
Zhou, Wengang ;
Li, Houqiang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1749-1759
[6]  
Dosovitskiy A., 2020, INT C LEARN REPR, P1
[7]  
Gu X., 2021, arXiv
[8]   LVIS: A Dataset for Large Vocabulary Instance Segmentation [J].
Gupta, Agrim ;
Dollar, Piotr ;
Girshick, Ross .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5351-5359
[9]  
Ha H., 2022, C ROB LEARN
[10]  
He SA, 2022, Arxiv, DOI arXiv:2208.09374