CRIS: CLIP-Driven Referring Image Segmentation

被引:208
作者
Wang, Zhaoqing [1 ,2 ]
Lu, Yu [3 ]
Li, Qiang [4 ]
Tao, Xunqiang [2 ]
Guo, Yandong [2 ]
Gong, Mingming [5 ]
Liu, Tongliang [1 ]
机构
[1] Univ Sydney, Sydney, NSW, Australia
[2] OPPO Res Inst, Dongguan, Peoples R China
[3] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
[4] Kuaishou Technol, Beijing, Peoples R China
[5] Univ Melbourne, Parkville, Vic, Australia
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.01139
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation aims to segment a referent via a natural linguistic expression. Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CUP), in this paper, we propose an end-to-end CUP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically; we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
引用
收藏
页码:11676 / 11685
页数:10
相关论文
共 50 条
[1]  
[Anonymous], 2020, EUR C COMP VIS, DOI DOI 10.1145/3373625.3417010
[2]  
Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1
[3]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[4]  
Chen T., 2020, P INT C MACH LEARN, P1597
[5]  
Chen Xinlei, 2020, ARXIV200304297
[6]  
Chen Y., 2019, BMVC
[7]  
Ding Henghui, 2021, P IEEECVF INT C COMP, P16321
[8]   Coordination Chemistry Engineered Polymeric Carbon Nitride Photoanode with Ultralow Onset Potential for Water Splitting [J].
Fan, Xiangqian ;
Wang, Zhiliang ;
Lin, Tongen ;
Du, Du ;
Xiao, Mu ;
Chen, Peng ;
Monny, Sabiha Akter ;
Huang, Hengming ;
Lyu, Miaoqiang ;
Lu, Mingyuan ;
Wang, Lianzhou .
ANGEWANDTE CHEMIE-INTERNATIONAL EDITION, 2022, 61 (32)
[9]  
Fang Han, 2021, ARXIV210611097
[10]   Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [J].
Feng, Guang ;
Hu, Zhiwei ;
Zhang, Lihe ;
Lu, Huchuan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15501-15510