ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

被引：63

作者：

Zhou, Ziqin ^{[1
]}

Lei, Yinjie ^{[2
]}

Zhano, Bowen ^{[1
]}

Liu, Lingqiao ^{[1
]}

Liu, Yifan ^{[1
]}

机构：

[1] Univ Adelaide, Adelaide, Australia

[2] Sichuan Univ, Chengdu, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01075

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https: //github.com/ZiqinZhou66/ZegCLIP.git.

引用

页码：11175 / 11185

页数：11

共 58 条

[1] [Anonymous], 2021, P IEEE CVF INT C COM
[2] Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation
Baek, Donghyeon
Oh, Youngmin
Ham, Bumsub
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9516 - 9525
[3] Bucher M, 2019, ADV NEUR IN, V32
[4] Chen L. -C., 2014, ARXIV, DOI DOI 10.1080/17476938708814211
[5] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
Chen, Liang-Chieh
Papandreou, George
Kokkinos, Iasonas
Murphy, Kevin
Yuille, Alan L.
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) : 834 - 848
[6] Chen LB, 2017, IEEE INT SYMP NANO, P1, DOI 10.1109/NANOARCH.2017.8053709
[7] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
Chen, Xiaokang
Yuan, Yuhui
Zeng, Gang
Wang, Jingdong
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2613 - 2622
[8] Cheng B, 2021, ADV NEUR IN, V34
[9] Masked-attention Mask Transformer for Universal Image Segmentation
Cheng, Bowen
Misra, Ishan
Schwing, Alexander G.
Kirillov, Alexander
Girdhar, Rohit
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1280 - 1289
[10] SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation
Cheng, Jiaxin
Nandi, Soumyaroop
Natarajan, Prem
Abd-Almageed, Wael
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9536 - 9546

← 1 2 3 4 5 6 →