RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension

被引:8
|
作者
Jin, Lei [1 ,2 ]
Luo, Gen [1 ]
Zhou, Yiyi [1 ,2 ]
Sun, Xiaoshuai [1 ,2 ]
Jiang, Guannan [3 ]
Shu, Annan [3 ]
Ji, Rongrong [1 ,2 ]
机构
[1] Xiamen Univ, Minist Educ China, Key Lab Multimedia Trusted Percept & Efficient Co, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen 361005, Peoples R China
[3] Contemporary Amperex Technol Co Ltd CATE, Intelligent Mfg Dept, Ningde, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00263
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Expression Comprehension (REC) is a task of grounding the referent based on an expression, and its development is greatly limited by expensive instance-level annotations. Most existing weakly supervised methods are built based on two-stage detection networks, which are computationally expensive. In this paper, we resort to the efficient one-stage detector and propose a novel weakly supervised model called RefCLIP. Specifically, RefCLIP redefines weakly supervised REC as an anchor-text matching problem, which can avoid the complex post-processing in existing methods. To achieve weakly supervised learning, we introduce anchor-based contrastive loss to optimize RefCLIP via numerous anchor-text pairs. Based on RefCLIP, we further propose the first model-agnostic weakly supervised training scheme for existing REC models, where RefCLIP acts as a mature teacher to generate pseudo-labels for teaching common REC models. With our careful designs, this scheme can even help existing REC models achieve better weakly supervised performance than RefCLIP, e.g., TransVG and SimREC. To validate our approaches, we conduct extensive experiments on four REC benchmarks, i.e., RefCOCO, RefCOCO+, RefCOCOg and ReferItGame. Experimental results not only report our significant performance gains over existing weakly supervised models, e.g., +24.87% on RefCOCO, but also show the 5x faster inference speed. Project: https://refclip.github.io.
引用
收藏
页码:2681 / 2690
页数:10
相关论文
共 50 条
  • [1] Adaptive knowledge distillation and integration for weakly supervised referring expression comprehension
    Mi, Jinpeng
    Wermter, Stefan
    Zhang, Jianwei
    KNOWLEDGE-BASED SYSTEMS, 2024, 286
  • [2] Universal Relocalizer forWeakly Supervised Referring Expression Grounding
    Zhang, Panpan
    Liu, Meng
    Song, Xuemeng
    Cao, Da
    Gao, Zan
    Nie, Liqiang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
  • [3] APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension
    Lu, Yaxin
    Ji, Jiayi
    Chen, Xiaofu
    Zhang, Yuxin
    Ren, Tianhe
    Luo, Gen
    COMPUTER VISION - ECCV 2024, PT XIII, 2025, 15071 : 198 - 215
  • [4] Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding
    Liu, Xuejing
    Li, Liang
    Wang, Shuhui
    Zha, Zheng-Jun
    Meng, Dechao
    Huang, Qingming
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2611 - 2620
  • [5] Weakly supervised video object segmentation initialized with referring expression
    Bu, Xiaoqing
    Sun, Yukuan
    Wang, Jianming
    Liu, Kunliang
    Liang, Jiayu
    Jin, Guanghao
    Chung, Tae-Sun
    NEUROCOMPUTING, 2021, 453 : 754 - 765
  • [6] RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension
    Sun, Jiamu
    Luo, Gen
    Zhou, Yiyi
    Sun, Xiaoshuai
    Jiang, Guannan
    Wang, Zhiyu
    Ji, Rongrong
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19144 - 19154
  • [7] SAFARI: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
    Nag, Sayan
    Goswami, Koustava
    Karanam, Srikrishna
    COMPUTER VISION-ECCV 2024, PT XLIV, 2025, 15102 : 485 - 503
  • [8] Progressive Semantic Reconstruction Network for Weakly Supervised Referring Expression Grounding
    Ji, Zhong
    Wu, Jiahe
    Wang, Yaodong
    Yang, Aiping
    Han, Jungong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 13058 - 13070
  • [9] Fully and Weakly Supervised Referring Expression Segmentation With End-to-End Learning
    Li, Hui
    Sun, Mingjie
    Xiao, Jimin
    Lim, Eng Gee
    Zhao, Yao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5999 - 6012
  • [10] Weakly Supervised Referring Expression Grounding via Dynamic Self-Knowledge Distillation
    Mi, Jinpeng
    Chen, Zhiqian
    Zhang, Jianwei
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1254 - 1260