DatUS: Data-Driven Unsupervised Semantic Segmentation With Pretrained Self-Supervised Vision Transformer

被引:3
作者
Kumar, Sonal [1 ]
Sur, Arijit [1 ]
Baruah, Rashmi Dutta [1 ]
机构
[1] Indian Inst Technol, Gauhati 781039, India
关键词
Training; Task analysis; Transformers; Semantic segmentation; Feature extraction; Visualization; Semantics; Computer vision; deep learning; representation learning; self-supervised learning; semantic segmentation; unsupervised learning; vision transformer (ViT);
D O I
10.1109/TCDS.2024.3383952
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Successive proposals of several self-supervised training schemes (STSs) continue to emerge, taking one step closer to developing a universal foundation model. In this process, unsupervised downstream tasks are recognized as one of the evaluation methods to validate the quality of visual features learned with self-supervised training. However, unsupervised dense semantic segmentation has yet to be explored as a downstream task, which can utilize and evaluate the quality of semantic information introduced in patch-level feature representations during self-supervised training of vision transformers. Therefore, we propose a novel data-driven framework, DatUS, to perform unsupervised dense semantic segmentation (DSS) as a downstream task. DatUS generates semantically consistent pseudosegmentation masks for an unlabeled image dataset without using visual prior or synchronized data. The experiment shows that the proposed framework achieves the highest MIoU (24.90) and average F1 score (36.3) by choosing DINOv2 and the highest pixel accuracy (62.18) by choosing DINO as the STS on the training set of SUIM dataset. It also outperforms state-of-the-art methods for the unsupervised DSS task with 15.02% MIoU, 21.47% pixel accuracy, and 16.06% average F1 score on the validation set of SUIM dataset. It achieves a competitive level of accuracy for a large-scale COCO dataset.
引用
收藏
页码:1775 / 1788
页数:14
相关论文
共 47 条
[1]  
An X, 2023, Arxiv, DOI arXiv:2304.05884
[2]  
Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027
[3]   DETReg: Unsupervised Pretraining with Region Priors for Object Detection [J].
Bar, Amir ;
Wang, Xin ;
Kantorov, Vadim ;
Reed, Colorado J. ;
Herzig, Roei ;
Chechik, Gal ;
Rohrbach, Anna ;
Darrell, Trevor ;
Globerson, Amir .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :14585-14595
[4]   The Fast Bilateral Solver [J].
Barron, Jonathan T. ;
Poole, Ben .
COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 :617-632
[5]   Fast unfolding of communities in large networks [J].
Blondel, Vincent D. ;
Guillaume, Jean-Loup ;
Lambiotte, Renaud ;
Lefebvre, Etienne .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2008,
[6]   COCO-Stuff: Thing and Stuff Classes in Context [J].
Caesar, Holger ;
Uijlings, Jasper ;
Ferrari, Vittorio .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1209-1218
[7]  
Caron M, 2020, ADV NEUR IN, V33
[8]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[9]  
Chen LC, 2017, Arxiv, DOI arXiv:1706.05587
[10]  
Chen T, 2020, PR MACH LEARN RES, V119