RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension

被引：8

作者：

Jin, Lei ^{[1
,2
]}

Luo, Gen ^{[1
]}

Zhou, Yiyi ^{[1
,2
]}

Sun, Xiaoshuai ^{[1
,2
]}

Jiang, Guannan ^{[3
]}

Shu, Annan ^{[3
]}

Ji, Rongrong ^{[1
,2
]}

机构：

[1] Xiamen Univ, Minist Educ China, Key Lab Multimedia Trusted Percept & Efficient Co, Xiamen 361005, Peoples R China

[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen 361005, Peoples R China

[3] Contemporary Amperex Technol Co Ltd CATE, Intelligent Mfg Dept, Ningde, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52729.2023.00263

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Expression Comprehension (REC) is a task of grounding the referent based on an expression, and its development is greatly limited by expensive instance-level annotations. Most existing weakly supervised methods are built based on two-stage detection networks, which are computationally expensive. In this paper, we resort to the efficient one-stage detector and propose a novel weakly supervised model called RefCLIP. Specifically, RefCLIP redefines weakly supervised REC as an anchor-text matching problem, which can avoid the complex post-processing in existing methods. To achieve weakly supervised learning, we introduce anchor-based contrastive loss to optimize RefCLIP via numerous anchor-text pairs. Based on RefCLIP, we further propose the first model-agnostic weakly supervised training scheme for existing REC models, where RefCLIP acts as a mature teacher to generate pseudo-labels for teaching common REC models. With our careful designs, this scheme can even help existing REC models achieve better weakly supervised performance than RefCLIP, e.g., TransVG and SimREC. To validate our approaches, we conduct extensive experiments on four REC benchmarks, i.e., RefCOCO, RefCOCO+, RefCOCOg and ReferItGame. Experimental results not only report our significant performance gains over existing weakly supervised models, e.g., +24.87% on RefCOCO, but also show the 5x faster inference speed. Project: https://refclip.github.io.

引用

页码：2681 / 2690

页数：10

共 50 条

[11] Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding
Liu, Xuejing
Li, Liang
Wang, Shuhui
Zha, Zheng-Jun
Su, Li
Huang, Qingming
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 539 - 547
[12] Weakly Supervised Referring Expression Grounding via Target-Guided Knowledge Distillation
Mi, Jinpeng
Tang, Song
Ma, Zhiyuan
Liu, Dan
Li, Qingdu
Zhang, Jianwei
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 8299 - 8305
[13] Entity-Enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding
Liu, Xuejing
Li, Liang
Wang, Shuhui
Zha, Zheng-Jun
Li, Zechao
Tian, Qi
Huang, Qingming
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3003 - 3018
[14] Dynamic Graph Attention for Referring Expression Comprehension
Yang, Sibei
Li, Guanbin
Yu, Yizhou
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4643 - 4652
[15] Exploring Logical Reasoning for Referring Expression Comprehension
Cheng, Ying
Wang, Ruize
Yu, Jiashuo
Zhao, Rui-Wei
Zhang, Yuejie
Feng, Rui
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5047 - 5055
[16] InterREC: An Interpretable Method for Referring Expression Comprehension
Wang, Wenbin
Pagnucco, Maurice
Xu, Chengpei
Song, Yang
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9330 - 9342
[17] Image Segmentation With Language Referring Expression and Comprehension
Sun, Jiaxing
Li, Yujie
Cai, Jintong
Lu, Huimin
Serikawa, Seiichi
IEEE SENSORS JOURNAL, 2022, 22 (18) : 17406 - 17413
[18] ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Sul, Wei
Miao, Peihan
Doul, Huanzhang
Li, Xi
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13449 - 13458
[19] Referring Expression Generation and Comprehension via Attributes
Liu, Jingyu
Wang, Liang
Yang, Ming-Hsuan
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4866 - 4874
[20] Revisiting Counterfactual Problems in Referring Expression Comprehension
Yu, Zhihan
Li, Ruifan
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13438 - 13448

← 1 2 3 4 5 →