RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension

被引:8
作者
Jin, Lei [1 ,2 ]
Luo, Gen [1 ]
Zhou, Yiyi [1 ,2 ]
Sun, Xiaoshuai [1 ,2 ]
Jiang, Guannan [3 ]
Shu, Annan [3 ]
Ji, Rongrong [1 ,2 ]
机构
[1] Xiamen Univ, Minist Educ China, Key Lab Multimedia Trusted Percept & Efficient Co, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen 361005, Peoples R China
[3] Contemporary Amperex Technol Co Ltd CATE, Intelligent Mfg Dept, Ningde, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.00263
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Expression Comprehension (REC) is a task of grounding the referent based on an expression, and its development is greatly limited by expensive instance-level annotations. Most existing weakly supervised methods are built based on two-stage detection networks, which are computationally expensive. In this paper, we resort to the efficient one-stage detector and propose a novel weakly supervised model called RefCLIP. Specifically, RefCLIP redefines weakly supervised REC as an anchor-text matching problem, which can avoid the complex post-processing in existing methods. To achieve weakly supervised learning, we introduce anchor-based contrastive loss to optimize RefCLIP via numerous anchor-text pairs. Based on RefCLIP, we further propose the first model-agnostic weakly supervised training scheme for existing REC models, where RefCLIP acts as a mature teacher to generate pseudo-labels for teaching common REC models. With our careful designs, this scheme can even help existing REC models achieve better weakly supervised performance than RefCLIP, e.g., TransVG and SimREC. To validate our approaches, we conduct extensive experiments on four REC benchmarks, i.e., RefCOCO, RefCOCO+, RefCOCOg and ReferItGame. Experimental results not only report our significant performance gains over existing weakly supervised models, e.g., +24.87% on RefCOCO, but also show the 5x faster inference speed. Project: https://refclip.github.io.
引用
收藏
页码:2681 / 2690
页数:10
相关论文
共 50 条
  • [41] Continual Referring Expression Comprehension via Dual Modular Memorization
    Shen, Heng Tao
    Chen, Cheng
    Wang, Peng
    Gao, Lianli
    Wang, Meng
    Song, Jingkuan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 6694 - 6706
  • [42] Referring expression comprehension model with matching detection and linguistic feedback
    Wang, Jianming
    Cui, Enjie
    Liu, Kunliang
    Sun, Yukuan
    Liang, Jiayu
    Yuan, Chunmiao
    Duan, Xiaojie
    Jin, Guanghao
    Chung, Tae-Sun
    IET COMPUTER VISION, 2020, 14 (08) : 625 - 633
  • [43] Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping
    Zhang, Chao
    Li, Weiming
    Ouyang, Wanli
    Wang, Qiang
    Kim, Woo-Shik
    Hong, Sunghoon
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1258 - 1266
  • [44] CSRef: Contrastive Semantic Alignment for Speech Referring Expression Comprehension
    Huang, Lihong
    Zhong, Sheng-Hua
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON METHODOLOGIES FOR MULTIMEDIA 2024, MEET4MM 2024, 2024, : 28 - 34
  • [45] Weakly-Supervised Semantic Segmentation with Mean Teacher Learning
    Tan, Li
    Luo, WenFeng
    Yang, Meng
    INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: VISUAL DATA ENGINEERING, PT I, 2019, 11935 : 324 - 335
  • [46] Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding
    Sun, Mingjie
    Xiao, Jimin
    Lim, Eng Gee
    Liu, Si
    Goulermas, John Y.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) : 4189 - 4195
  • [47] RESMatch: Referring expression segmentation in a semi-supervised manner
    Zang, Ying
    Cao, Runlong
    Fu, Chenglong
    Zhu, Didi
    Zhang, Min
    Hu, Wenjun
    Zhu, Lanyun
    Chen, Tianrun
    INFORMATION SCIENCES, 2025, 694
  • [48] Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge
    Wang, Peng
    Liu, Dongyang
    Li, Hui
    Wu, Qi
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 28 - 36
  • [49] Referring Expression Comprehension by Composing Semantic-based Visual Attention
    Zhu, Zheng-An
    Chiang, Hsuan-Lun
    Chiang, Chen-Kuo
    2022 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN, IEEE ICCE-TW 2022, 2022, : 345 - 346
  • [50] ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
    Subramanian, Sanjay
    Merrill, Will
    Darrell, Trevor
    Gardner, Matt
    Singh, Sameer
    Rohrbach, Anna
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5198 - 5215