RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension

被引:8
|
作者
Jin, Lei [1 ,2 ]
Luo, Gen [1 ]
Zhou, Yiyi [1 ,2 ]
Sun, Xiaoshuai [1 ,2 ]
Jiang, Guannan [3 ]
Shu, Annan [3 ]
Ji, Rongrong [1 ,2 ]
机构
[1] Xiamen Univ, Minist Educ China, Key Lab Multimedia Trusted Percept & Efficient Co, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen 361005, Peoples R China
[3] Contemporary Amperex Technol Co Ltd CATE, Intelligent Mfg Dept, Ningde, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00263
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Expression Comprehension (REC) is a task of grounding the referent based on an expression, and its development is greatly limited by expensive instance-level annotations. Most existing weakly supervised methods are built based on two-stage detection networks, which are computationally expensive. In this paper, we resort to the efficient one-stage detector and propose a novel weakly supervised model called RefCLIP. Specifically, RefCLIP redefines weakly supervised REC as an anchor-text matching problem, which can avoid the complex post-processing in existing methods. To achieve weakly supervised learning, we introduce anchor-based contrastive loss to optimize RefCLIP via numerous anchor-text pairs. Based on RefCLIP, we further propose the first model-agnostic weakly supervised training scheme for existing REC models, where RefCLIP acts as a mature teacher to generate pseudo-labels for teaching common REC models. With our careful designs, this scheme can even help existing REC models achieve better weakly supervised performance than RefCLIP, e.g., TransVG and SimREC. To validate our approaches, we conduct extensive experiments on four REC benchmarks, i.e., RefCOCO, RefCOCO+, RefCOCOg and ReferItGame. Experimental results not only report our significant performance gains over existing weakly supervised models, e.g., +24.87% on RefCOCO, but also show the 5x faster inference speed. Project: https://refclip.github.io.
引用
收藏
页码:2681 / 2690
页数:10
相关论文
共 50 条
  • [21] Correspondence Matters for Video Referring Expression Comprehension
    Cao, Meng
    Jiang, Ji
    Chen, Long
    Zou, Yuexian
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4967 - 4976
  • [22] Inexactly Matched Referring Expression Comprehension With Rationale
    Li, Xiaochuan
    Fan, Baoyu
    Zhang, Runze
    Zhao, Kun
    Guo, Zhenhua
    Zhao, Yaqian
    Li, Rengang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3937 - 3950
  • [23] Referring Expression Comprehension: A Survey of Methods and Datasets
    Qiao, Yanyuan
    Deng, Chaorui
    Wu, Qi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 4426 - 4440
  • [24] Relationship Aggregation Network for Referring Expression Comprehension
    Guo W.
    Zhang Y.
    Liu S.
    Yang J.
    Yuan X.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (11): : 2611 - 2623
  • [25] Towards Further Comprehension on Referring Expression with Rationale
    Li, Rengang
    Fan, Baoyu
    Li, Xiaochuan
    Zhang, Runze
    Guo, Zhenhua
    Zhao, Kun
    Zhao, Yaqian
    Gong, Weifeng
    Wang, Endong
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4336 - 4344
  • [26] Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation
    Dai, Qiyuan
    Yang, Sibei
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13711 - 13722
  • [27] Multi-level attention for referring expression comprehension
    Sun, Yanfeng
    Zhang, Yunru
    Jiang, Huajie
    Hu, Yongli
    Yin, Baocai
    PATTERN RECOGNITION LETTERS, 2023, 172 : 252 - 258
  • [28] LUNA: Language as Continuing Anchors for Referring Expression Comprehension
    Liang, Yaoyuan
    Yang, Zhao
    Tang, Yansong
    Fan, Jiashuo
    Li, Ziran
    Wang, Jingang
    Torr, Philip H. S.
    Huang, Shao-Lun
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5174 - 5184
  • [29] MAttNet: Modular Attention Network for Referring Expression Comprehension
    Yu, Licheng
    Lin, Zhe
    Shen, Xiaohui
    Yang, Jimei
    Lu, Xin
    Bansal, Mohit
    Berg, Tamara L.
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1307 - 1315
  • [30] Referring Expression Comprehension Using Language Adaptive Inference
    Su, Wei
    Miao, Peihan
    Dou, Huanzhang
    Fu, Yongjian
    Li, Xi
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 2357 - 2365