SegLD: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes

被引:1
作者
Zheng, Hongtao [1 ]
Ding, Yifei [1 ]
Wang, Zilong [1 ]
Huang, Xinyan [1 ,2 ]
机构
[1] Hong Kong Polytech Univ, Dept Bldg Environm & Energy Engn, Hong Kong, Peoples R China
[2] Hong Kong Polytech Univ, Shenzhen Res Inst, Shenzhen, Peoples R China
关键词
Open-vocabulary; Universal; Latent diffusion process; Multimodel fusion; Contrastive loss; COMPUTER VISION; DATASET;
D O I
10.1016/j.inffus.2024.102509
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Open-vocabulary learning can identify categories marked during training (seen categories) and generalize to categories not annotated in the training set (unseen categories). It could theoretically extend segmentation systems to more universal applications. However, current open-vocabulary segmentation frameworks are primarily suited for specific tasks or require retraining according to the task, and they significantly underperform in inferring seen categories compared to fully supervised frameworks. Therefore, we introduce a universal open-vocabulary segmentation framework based on the latent diffusion process ( SegLD ), which requires only a single training session on a panoptic dataset to achieve inference across all open-vocabulary segmentation tasks, and reaches SOTA segmentation performance for both seen and unseen categories in every task. Specifically, SegLD comprises two stages: in the first stage, we deploy two parallel latent diffusion processes to deeply fuse the text (image caption or category labels) and image information, further aggregating the multi-scale features output from both latent diffusion processes on a scale basis. In the second stage, we introduce text queries, text list queries, and task queries, facilitating the learning of inter-category and inter-task differences through the computation of contrastive losses between them. Text queries are then further fed into a Transformer Decoder to obtain category-agnostic segmentation masks. Then we establish classification loss functions for the type of text input during training, whether image captions or category labels, to help assign a category label from the open vocabulary to each predicted binary mask. Experimental results show that, with just a single training session, SegLD significantly outperforms other contemporary SOTA fully supervised segmentation frameworks and open-vocabulary segmentation frameworks across almost all evaluation metrics for both known and unknown categories on the ADE20K, Cityscapes, and COCO datasets. This highlights SegLD's capability as a universal segmentation framework, with the potential to replace other segmentation frameworks and adapt to various segmentation domains. The project link for SegLD is https://zht-segld.github.io/.
引用
收藏
页数:19
相关论文
共 96 条
  • [1] EdgeFireSmoke plus plus : A novel lightweight algorithm for real-time forest fire detection and visualization using internet of things-human machine interface
    Almeida, Jefferson S.
    Jagatheesaperumal, Senthil Kumar
    Nogueira, Fabricio G.
    de Albuquerque, Victor Hugo C.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 221
  • [2] Baranchuk D., 2021, arXiv
  • [3] Bucher M, 2019, ADV NEUR IN, V32
  • [4] Emerging Properties in Self-Supervised Vision Transformers
    Caron, Mathilde
    Touvron, Hugo
    Misra, Ishan
    Jegou, Herve
    Mairal, Julien
    Bojanowski, Piotr
    Joulin, Armand
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
  • [5] Chen L.-C., 2020, ARXIV
  • [6] An Empirical Study of Training Self-Supervised Vision Transformers
    Chen, Xinlei
    Xie, Saining
    He, Kaiming
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9620 - 9629
  • [7] Cheng B, 2021, ADV NEUR IN, V34
  • [8] Masked-attention Mask Transformer for Universal Image Segmentation
    Cheng, Bowen
    Misra, Ishan
    Schwing, Alexander G.
    Kirillov, Alexander
    Girdhar, Rohit
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1280 - 1289
  • [9] Cheng BW, 2020, PROC CVPR IEEE, P12472, DOI 10.1109/CVPR42600.2020.01249
  • [10] Reproducible scaling laws for contrastive language-image learning
    Cherti, Mehdi
    Beaumont, Romain
    Wightman, Ross
    Wortsman, Mitchell
    Ilharco, Gabriel
    Gordon, Cade
    Schuhmann, Christoph
    Schmidt, Ludwig
    Jitsevi, Jenia
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2818 - 2829