RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension

被引：8

作者：

Jin, Lei ^{[1
,2
]}

Luo, Gen ^{[1
]}

Zhou, Yiyi ^{[1
,2
]}

Sun, Xiaoshuai ^{[1
,2
]}

Jiang, Guannan ^{[3
]}

Shu, Annan ^{[3
]}

Ji, Rongrong ^{[1
,2
]}

机构：

[1] Xiamen Univ, Minist Educ China, Key Lab Multimedia Trusted Percept & Efficient Co, Xiamen 361005, Peoples R China

[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen 361005, Peoples R China

[3] Contemporary Amperex Technol Co Ltd CATE, Intelligent Mfg Dept, Ningde, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52729.2023.00263

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Expression Comprehension (REC) is a task of grounding the referent based on an expression, and its development is greatly limited by expensive instance-level annotations. Most existing weakly supervised methods are built based on two-stage detection networks, which are computationally expensive. In this paper, we resort to the efficient one-stage detector and propose a novel weakly supervised model called RefCLIP. Specifically, RefCLIP redefines weakly supervised REC as an anchor-text matching problem, which can avoid the complex post-processing in existing methods. To achieve weakly supervised learning, we introduce anchor-based contrastive loss to optimize RefCLIP via numerous anchor-text pairs. Based on RefCLIP, we further propose the first model-agnostic weakly supervised training scheme for existing REC models, where RefCLIP acts as a mature teacher to generate pseudo-labels for teaching common REC models. With our careful designs, this scheme can even help existing REC models achieve better weakly supervised performance than RefCLIP, e.g., TransVG and SimREC. To validate our approaches, we conduct extensive experiments on four REC benchmarks, i.e., RefCOCO, RefCOCO+, RefCOCOg and ReferItGame. Experimental results not only report our significant performance gains over existing weakly supervised models, e.g., +24.87% on RefCOCO, but also show the 5x faster inference speed. Project: https://refclip.github.io.

引用

页码：2681 / 2690

页数：10

共 50 条

[41] Continual Referring Expression Comprehension via Dual Modular Memorization
Shen, Heng Tao
Chen, Cheng
Wang, Peng
Gao, Lianli
Wang, Meng
Song, Jingkuan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 6694 - 6706
[42] Referring expression comprehension model with matching detection and linguistic feedback
Wang, Jianming
Cui, Enjie
Liu, Kunliang
Sun, Yukuan
Liang, Jiayu
Yuan, Chunmiao
Duan, Xiaojie
Jin, Guanghao
Chung, Tae-Sun
IET COMPUTER VISION, 2020, 14 (08) : 625 - 633
[43] Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping
Zhang, Chao
Li, Weiming
Ouyang, Wanli
Wang, Qiang
Kim, Woo-Shik
Hong, Sunghoon
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1258 - 1266
[44] CSRef: Contrastive Semantic Alignment for Speech Referring Expression Comprehension
Huang, Lihong
Zhong, Sheng-Hua
PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON METHODOLOGIES FOR MULTIMEDIA 2024, MEET4MM 2024, 2024, : 28 - 34
[45] Weakly-Supervised Semantic Segmentation with Mean Teacher Learning
Tan, Li
Luo, WenFeng
Yang, Meng
INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: VISUAL DATA ENGINEERING, PT I, 2019, 11935 : 324 - 335
[46] Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding
Sun, Mingjie
Xiao, Jimin
Lim, Eng Gee
Liu, Si
Goulermas, John Y.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) : 4189 - 4195
[47] RESMatch: Referring expression segmentation in a semi-supervised manner
Zang, Ying
Cao, Runlong
Fu, Chenglong
Zhu, Didi
Zhang, Min
Hu, Wenjun
Zhu, Lanyun
Chen, Tianrun
INFORMATION SCIENCES, 2025, 694
[48] Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge
Wang, Peng
Liu, Dongyang
Li, Hui
Wu, Qi
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 28 - 36
[49] Referring Expression Comprehension by Composing Semantic-based Visual Attention
Zhu, Zheng-An
Chiang, Hsuan-Lun
Chiang, Chen-Kuo
2022 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN, IEEE ICCE-TW 2022, 2022, : 345 - 346
[50] ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
Subramanian, Sanjay
Merrill, Will
Darrell, Trevor
Gardner, Matt
Singh, Sameer
Rohrbach, Anna
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5198 - 5215

← 1 2 3 4 5 →