RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension

被引：8

作者：

Jin, Lei ^{[1
,2
]}

Luo, Gen ^{[1
]}

Zhou, Yiyi ^{[1
,2
]}

Sun, Xiaoshuai ^{[1
,2
]}

Jiang, Guannan ^{[3
]}

Shu, Annan ^{[3
]}

Ji, Rongrong ^{[1
,2
]}

机构：

[1] Xiamen Univ, Minist Educ China, Key Lab Multimedia Trusted Percept & Efficient Co, Xiamen 361005, Peoples R China

[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen 361005, Peoples R China

[3] Contemporary Amperex Technol Co Ltd CATE, Intelligent Mfg Dept, Ningde, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52729.2023.00263

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Expression Comprehension (REC) is a task of grounding the referent based on an expression, and its development is greatly limited by expensive instance-level annotations. Most existing weakly supervised methods are built based on two-stage detection networks, which are computationally expensive. In this paper, we resort to the efficient one-stage detector and propose a novel weakly supervised model called RefCLIP. Specifically, RefCLIP redefines weakly supervised REC as an anchor-text matching problem, which can avoid the complex post-processing in existing methods. To achieve weakly supervised learning, we introduce anchor-based contrastive loss to optimize RefCLIP via numerous anchor-text pairs. Based on RefCLIP, we further propose the first model-agnostic weakly supervised training scheme for existing REC models, where RefCLIP acts as a mature teacher to generate pseudo-labels for teaching common REC models. With our careful designs, this scheme can even help existing REC models achieve better weakly supervised performance than RefCLIP, e.g., TransVG and SimREC. To validate our approaches, we conduct extensive experiments on four REC benchmarks, i.e., RefCOCO, RefCOCO+, RefCOCOg and ReferItGame. Experimental results not only report our significant performance gains over existing weakly supervised models, e.g., +24.87% on RefCOCO, but also show the 5x faster inference speed. Project: https://refclip.github.io.

引用

页码：2681 / 2690

页数：10

共 50 条

[1] Adaptive knowledge distillation and integration for weakly supervised referring expression comprehension
Mi, Jinpeng
Wermter, Stefan
Zhang, Jianwei
KNOWLEDGE-BASED SYSTEMS, 2024, 286
[2] Universal Relocalizer forWeakly Supervised Referring Expression Grounding
Zhang, Panpan
Liu, Meng
Song, Xuemeng
Cao, Da
Gao, Zan
Nie, Liqiang
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
[3] APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension
Lu, Yaxin
Ji, Jiayi
Chen, Xiaofu
Zhang, Yuxin
Ren, Tianhe
Luo, Gen
COMPUTER VISION - ECCV 2024, PT XIII, 2025, 15071 : 198 - 215
[4] Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding
Liu, Xuejing
Li, Liang
Wang, Shuhui
Zha, Zheng-Jun
Meng, Dechao
Huang, Qingming
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2611 - 2620
[5] Weakly supervised video object segmentation initialized with referring expression
Bu, Xiaoqing
Sun, Yukuan
Wang, Jianming
Liu, Kunliang
Liang, Jiayu
Jin, Guanghao
Chung, Tae-Sun
NEUROCOMPUTING, 2021, 453 : 754 - 765
[6] RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension
Sun, Jiamu
Luo, Gen
Zhou, Yiyi
Sun, Xiaoshuai
Jiang, Guannan
Wang, Zhiyu
Ji, Rongrong
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19144 - 19154
[7] SAFARI: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Nag, Sayan
Goswami, Koustava
Karanam, Srikrishna
COMPUTER VISION-ECCV 2024, PT XLIV, 2025, 15102 : 485 - 503
[8] Progressive Semantic Reconstruction Network for Weakly Supervised Referring Expression Grounding
Ji, Zhong
Wu, Jiahe
Wang, Yaodong
Yang, Aiping
Han, Jungong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 13058 - 13070
[9] Fully and Weakly Supervised Referring Expression Segmentation With End-to-End Learning
Li, Hui
Sun, Mingjie
Xiao, Jimin
Lim, Eng Gee
Zhao, Yao
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5999 - 6012
[10] Weakly Supervised Referring Expression Grounding via Dynamic Self-Knowledge Distillation
Mi, Jinpeng
Chen, Zhiqian
Zhang, Jianwei
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1254 - 1260

← 1 2 3 4 5 →