CRIS: CLIP-Driven Referring Image Segmentation

被引:163
作者
Wang, Zhaoqing [1 ,2 ]
Lu, Yu [3 ]
Li, Qiang [4 ]
Tao, Xunqiang [2 ]
Guo, Yandong [2 ]
Gong, Mingming [5 ]
Liu, Tongliang [1 ]
机构
[1] Univ Sydney, Sydney, NSW, Australia
[2] OPPO Res Inst, Dongguan, Peoples R China
[3] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
[4] Kuaishou Technol, Beijing, Peoples R China
[5] Univ Melbourne, Parkville, Vic, Australia
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.01139
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation aims to segment a referent via a natural linguistic expression. Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CUP), in this paper, we propose an end-to-end CUP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically; we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
引用
收藏
页码:11676 / 11685
页数:10
相关论文
共 50 条
  • [1] Ba Jimmy Lei, 2016, LAYER NORMALIZATION, DOI 10.48550/arXiv.1607.06450
  • [2] Carion N., 2020, EUROPEAN C COMPUTER, V12346, P213, DOI 10.1007/978-3-030-58452-8_13
  • [3] Chen X., 2020, ARXIV
  • [4] Chen Y.-W., 2019, BRIT MACH VIS C
  • [5] Ding Henghui, 2021, ICCV, P16321
  • [6] Coordination Chemistry Engineered Polymeric Carbon Nitride Photoanode with Ultralow Onset Potential for Water Splitting
    Fan, Xiangqian
    Wang, Zhiliang
    Lin, Tongen
    Du, Du
    Xiao, Mu
    Chen, Peng
    Monny, Sabiha Akter
    Huang, Hengming
    Lyu, Miaoqiang
    Lu, Mingyuan
    Wang, Lianzhou
    [J]. ANGEWANDTE CHEMIE-INTERNATIONAL EDITION, 2022, 61 (32)
  • [7] Fang Han, 2021, ARXIV210611097
  • [8] Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation
    Feng, Guang
    Hu, Zhiwei
    Zhang, Lihe
    Lu, Huchuan
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15501 - 15510
  • [9] Dual Attention Network for Scene Segmentation
    Fu, Jun
    Liu, Jing
    Tian, Haijie
    Li, Yong
    Bao, Yongjun
    Fang, Zhiwei
    Lu, Hanqing
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3141 - 3149
  • [10] Hadsell R., 2006, 2006 IEEE COMP SOC C, P1735, DOI 10.1109/CVPR.2006.100