Purify Then Guide: A Bi-Directional Bridge Network for Open-Vocabulary Semantic Segmentation

被引:0
作者
Pan, Yuwen [1 ]
Sun, Rui [2 ]
Wang, Yuan
Yang, Wenfei [2 ,3 ]
Zhang, Tianzhu [2 ,3 ]
Zhang, Yongdong [2 ,4 ]
机构
[1] Univ Sci & Technol China, Sch Cyber Sci & Technol, Hefei 230027, Peoples R China
[2] Univ Sci & Technol China, Sch Informat Sci, Hefei 230027, Peoples R China
[3] Deep Space Explorat Lab, Hefei 230027, Peoples R China
[4] Peoples Daily Online, State Key Lab Commun Content Cognit, Beijing 100733, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Vocabulary; Semantic segmentation; Reliability; Visualization; Proposals; Modulation; Open-vocabulary semantic segmentation; semantic purification; bi-directional guidance; reliable attention;
D O I
10.1109/TCSVT.2024.3464631
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Open-vocabulary semantic segmentation (OVSS) aims to segment an image into regions of corresponding semantic vocabularies, without being limited to a predefined set of object categories. Existing works mainly utilize large-scale vision-language models (e.g., CLIP) to leverage their superior open-vocabulary classification abilities in a two-stage manner. However, their heavy reliance on the first-stage segmentation network leaves the full potential of CLIP untapped, creating an unresolved gap between the rich pre-training knowledge and the challenging per-pixel classification task. Although the recent one-stage paradigm has further leveraged pre-trained vision knowledge from CLIP, it fails to effectively utilize text information due to the inclusion of numerous unrelated semantics in the vocabulary list. How to avoid noise interference in text information and utilize language guidance remains a Gordian knot. In this paper, we propose a bi-directional bridge network (BBN) to bridge the gap between upstream pre-trained models and downstream segmentation tasks. It first purifies the noisy text embedding and then guides semantics-vision aggregation with the purified information in a purification-then-guidance manner, thereby facilitating effective semantic utilization. Specifically, we design an optimal purification modulator to purify noisy text information via the optimal transport algorithm, and a reliable guidance modulator to integrate proper textual information into vision embedding via the designed reliable attention in an adaptive manner. Extensive experimental results on five challenging benchmarks demonstrate that our BBN performs favorably against state-of-the-art open-vocabulary semantic segmentation methods.
引用
收藏
页码:343 / 356
页数:14
相关论文
共 60 条
  • [1] Bucher M., P ADV NEUR INF PROC, V32
  • [2] COCO-Stuff: Thing and Stuff Classes in Context
    Caesar, Holger
    Uijlings, Jasper
    Ferrari, Vittorio
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1209 - 1218
  • [3] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [4] Once for All: A Two-Flow Convolutional Neural Network for Visual Tracking
    Chen, Kai
    Tao, Wenbing
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (12) : 3377 - 3386
  • [5] Chen LC, 2017, Arxiv, DOI arXiv:1706.05587
  • [6] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
    Chen, Liang-Chieh
    Papandreou, George
    Kokkinos, Iasonas
    Murphy, Kevin
    Yuille, Alan L.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) : 834 - 848
  • [7] Masked-attention Mask Transformer for Universal Image Segmentation
    Cheng, Bowen
    Misra, Ishan
    Schwing, Alexander G.
    Kirillov, Alexander
    Girdhar, Rohit
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1280 - 1289
  • [8] Cho S, 2024, Arxiv, DOI arXiv:2303.11797
  • [9] End-to-End Semantic Segmentation Utilizing Multi-Scale Baseline Light Field
    Cong, Ruixuan
    Sheng, Hao
    Yang, Dazhi
    Yang, Da
    Chen, Rongshan
    Wang, Sizhe
    Cui, Zhenglong
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5790 - 5804
  • [10] Class Enhancement Losses With Pseudo Labels for Open-Vocabulary Semantic Segmentation
    Dao, Son Duy
    Shi, Hengcan
    Phung, Dinh
    Cai, Jianfei
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8442 - 8453