SAN: Side Adapter Network for Open-Vocabulary Semantic Segmentation

被引:10
作者
Xu, Mengde [1 ]
Zhang, Zheng [1 ]
Wei, Fangyun [2 ]
Hu, Han [2 ]
Bai, Xiang [1 ]
机构
[1] Huazhong Univ Sci & Technol, Hongshan 430074, Peoples R China
[2] Microsoft Res Asia, Beijing 100080, Peoples R China
关键词
Adaptation models; Semantic segmentation; Predictive models; Proposals; Task analysis; Generators; Benchmark testing; Large-scale vision-language model; open-vocabulary semantic segmentation;
D O I
10.1109/TPAMI.2023.3311618
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article concentrates on open-vocabulary semantic segmentation, where a well optimized model is able to segment arbitrary categories that appear in an image. To achieve this goal, we present a novel framework termed Side Adapter Network, or SAN for short. Our design principles are three-fold: 1) Recent large-scale vision-language models (e.g. CLIP) show promising open-vocabulary image classification capability; it is training-economized to adapt a pre-trained CLIP model to open-vocabulary semantic segmentation. 2) Our SAN model should be both lightweight and effective in order to reduce the inference cost-to achieve this, we fuse the CLIP model's intermediate features to enhance the representation capability of the SAN model, and drive the CLIP model to focus on the informative areas of an image with the aid of the attention biases predicted by a side adapter network. 3) Our approach should empower mainstream segmentation architectures to have the capability of open-vocabulary segmentation-we present P-SAN and R-SAN, to support widely adopted pixel-wise segmentation and region-wise segmentation, respectively. Experimentally, our approach achieves state-of-the-art performance on 5 commonly used benchmarks while having much less trainable parameters and GFLOPs. For instance, our R-SAN outperforms previous best method OvSeg by +2.3 averaged mIoU across all benchmarks while using only 6% of trainable parameters and less than 1% of GFLOPs. In addition, we also conduct a comprehensive analysis of the open-vocabulary semantic segmentation datasets and verify the feasibility of transferring a well optimzied R-SAN model to video segmentation task.
引用
收藏
页码:15546 / 15561
页数:16
相关论文
共 54 条
[1]  
Alayrac JB, 2022, ADV NEUR IN
[2]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[3]   SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation [J].
Badrinarayanan, Vijay ;
Kendall, Alex ;
Cipolla, Roberto .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) :2481-2495
[4]  
Bucher M, 2019, ADV NEUR IN, V32
[5]   COCO-Stuff: Thing and Stuff Classes in Context [J].
Caesar, Holger ;
Uijlings, Jasper ;
Ferrari, Vittorio .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1209-1218
[6]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[7]   Delving Deep into Many-to-many Attention for Few-shot Video Object Segmentation [J].
Chen, Haoxin ;
Wu, Hanjie ;
Zhao, Nanxuan ;
Ren, Sucheng ;
He, Shengfeng .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :14035-14044
[8]   DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].
Chen, Liang-Chieh ;
Papandreou, George ;
Kokkinos, Iasonas ;
Murphy, Kevin ;
Yuille, Alan L. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848
[9]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[10]   Masked-attention Mask Transformer for Universal Image Segmentation [J].
Cheng, Bowen ;
Misra, Ishan ;
Schwing, Alexander G. ;
Kirillov, Alexander ;
Girdhar, Rohit .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1280-1289