ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

被引:63
作者
Zhou, Ziqin [1 ]
Lei, Yinjie [2 ]
Zhano, Bowen [1 ]
Liu, Lingqiao [1 ]
Liu, Yifan [1 ]
机构
[1] Univ Adelaide, Adelaide, Australia
[2] Sichuan Univ, Chengdu, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01075
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https: //github.com/ZiqinZhou66/ZegCLIP.git.
引用
收藏
页码:11175 / 11185
页数:11
相关论文
共 58 条
  • [1] [Anonymous], 2021, P IEEE CVF INT C COM
  • [2] Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation
    Baek, Donghyeon
    Oh, Youngmin
    Ham, Bumsub
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9516 - 9525
  • [3] Bucher M, 2019, ADV NEUR IN, V32
  • [4] Chen L. -C., 2014, ARXIV, DOI DOI 10.1080/17476938708814211
  • [5] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
    Chen, Liang-Chieh
    Papandreou, George
    Kokkinos, Iasonas
    Murphy, Kevin
    Yuille, Alan L.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) : 834 - 848
  • [6] Chen LB, 2017, IEEE INT SYMP NANO, P1, DOI 10.1109/NANOARCH.2017.8053709
  • [7] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
    Chen, Xiaokang
    Yuan, Yuhui
    Zeng, Gang
    Wang, Jingdong
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2613 - 2622
  • [8] Cheng B, 2021, ADV NEUR IN, V34
  • [9] Masked-attention Mask Transformer for Universal Image Segmentation
    Cheng, Bowen
    Misra, Ishan
    Schwing, Alexander G.
    Kirillov, Alexander
    Girdhar, Rohit
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1280 - 1289
  • [10] SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation
    Cheng, Jiaxin
    Nandi, Soumyaroop
    Natarajan, Prem
    Abd-Almageed, Wael
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9536 - 9546