WintN-CSG: a weakly supervised semantic segmentation network based on basic multimodal large-scale pre-trained models

被引：0

作者：

Haotian Wen ^{[1
]}

Derui Ding ^{[1
]}

Wei Liang ^{[1
]}

Ying Sun ^{[2
]}

机构：

[1] University of Shanghai for Science and Technology,School of Optical Electrical and Computer Engineering

[2] University of Shanghai for Science and Technology,Business School

来源：

Pattern Analysis and Applications | 2025年 / 28卷 / 2期

关键词：

Weakly supervised semantic segmentation; Attention selection class-aware attention-based affinity; Box mask denoising; SAM masks selection and fusion; SAM;

D O I：

10.1007/s10044-025-01489-8

中图分类号：

学科分类号：

摘要：

Weakly supervised semantic segmentation (WSSS), training segmentation models via image-level labels, has the advantage of low manual annotation cost compared with fully supervised semantic segmentation. However, the masks generated by the currently fashionable methods based on the Class Activation Map (CAM) still have the defects of low target segmentation accuracy, many noise pixels and incorrectly activated target pixels. To handle these shortages, the paper proposes a novel WSSS integration network (denoted as WintN-CSG) by adequately fusing the merits of scalability and versatility of multimodal pre-trained basic models CLIP, SAM as well as Grounding-DINO. The superiority of this network with CLIP as a basic framework also benefits from the creative development of Attention Selection Class-aware Attention-based Affinity (AS-CAA), Box Mask Denoising (BMD), and SAM Mask Selection and Fusion (SMSF) modules. Specifically, the proposed AS-CAA can effectively select the representative attention weight maps in Multi-Head Self-Attention (MHSA) to preliminarily remove noise pixels and modify incorrectly activated pixels. Subsequently, the designed BMD combined with Grounding-DINO can shield all noise pixels outside the bounding box, and accurately refine the isolated pixels inside the bounding box, improving the integrity and accuracy of the mask. Furthermore, the deployed SMSF screens out the most suitable masks among many superior mask candidates generated by SAM and makes up for the missing target pixels with the help of fusion and activation algorithms. Finally, experiments with only image-level labels on the PASCAL VOC 2012, MS COCO 2014 and CitySpace datasets show that our scheme achieves excellent performance in the efficiency of mask generation and segmentation accuracy.

引用

共 1 条

[1] Multistage Scene-Level Constraints for Large-Scale Point Cloud Weakly Supervised Semantic Segmentation
Su, Yanfei
Cheng, Ming
Yuan, Zhimin
Liu, Weiquan
Zeng, Wankang
Wang, Cheng
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61

← 1 →