CRIS: CLIP-Driven Referring Image Segmentation

被引：208

作者：

Wang, Zhaoqing ^{[1
,2
]}

Lu, Yu ^{[3
]}

Li, Qiang ^{[4
]}

Tao, Xunqiang ^{[2
]}

Guo, Yandong ^{[2
]}

Gong, Mingming ^{[5
]}

Liu, Tongliang ^{[1
]}

机构：

[1] Univ Sydney, Sydney, NSW, Australia

[2] OPPO Res Inst, Dongguan, Peoples R China

[3] Beijing Univ Posts & Telecommun, Beijing, Peoples R China

[4] Kuaishou Technol, Beijing, Peoples R China

[5] Univ Melbourne, Parkville, Vic, Australia

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01139

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring image segmentation aims to segment a referent via a natural linguistic expression. Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CUP), in this paper, we propose an end-to-end CUP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically; we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.

引用

页码：11676 / 11685

页数：10

共 50 条

[1]

[Anonymous], 2020, EUR C COMP VIS, DOI DOI 10.1145/3373625.3417010

[2]

Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1

[3] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[4]

Chen T., 2020, P INT C MACH LEARN, P1597

[5]

Chen Xinlei, 2020, ARXIV200304297

[6]

Chen Y., 2019, BMVC

[7]

Ding Henghui, 2021, P IEEECVF INT C COMP, P16321

[8] Coordination Chemistry Engineered Polymeric Carbon Nitride Photoanode with Ultralow Onset Potential for Water Splitting [J].

Fan, Xiangqian ;

Wang, Zhiliang ;

Lin, Tongen ;

Du, Du ;

Xiao, Mu ;

Chen, Peng ;

Monny, Sabiha Akter ;

Huang, Hengming ;

Lyu, Miaoqiang ;

Lu, Mingyuan ;

Wang, Lianzhou .

ANGEWANDTE CHEMIE-INTERNATIONAL EDITION, 2022, 61 (32)

[9]

Fang Han, 2021, ARXIV210611097

[10] Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [J].

Feng, Guang ;

Hu, Zhiwei ;

Zhang, Lihe ;

Lu, Huchuan .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15501-15510

← 1 2 3 4 5 →