Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

被引:0
作者
Yang, Xi [1 ]
Duan, Songsong [1 ]
Wang, Nannan [1 ]
Gao, Xinbo [2 ]
机构
[1] Xidian Univ, Xian, Peoples R China
[2] Chongqing Univ Posts & Telecommun, Chongqing, Peoples R China
来源
COMPUTER VISION - ECCV 2024, PT LXIX | 2025年 / 15127卷
基金
中国国家自然科学基金;
关键词
Weakly Supervised Object Localization; Segment Anything Model; Global Token; Mask Prompt;
D O I
10.1007/978-3-031-72890-7_24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03% and 66.85% Top-1 Loc, respectively.
引用
收藏
页码:387 / 403
页数:17
相关论文
共 50 条
[21]   Erasing Integrated Learning : A Simple yet Effective Approach for Weakly Supervised Object Localization [J].
Mai, Jinjie ;
Yang, Meng ;
Luo, Wenfeng .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :8763-8772
[22]   Segment anything model for medical image analysis: An experimental study [J].
Mazurowski, Maciej A. ;
Dong, Haoyu ;
Gu, Hanxue ;
Yang, Jichen ;
Konz, Nicholas ;
Zhang, Yixin .
MEDICAL IMAGE ANALYSIS, 2023, 89
[23]   Foreground Activation Maps for Weakly Supervised Object Localization [J].
Meng, Meng ;
Zhang, Tianzhu ;
Tian, Qi ;
Zhang, Yongdong ;
Wu, Feng .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :3365-3375
[24]   Unveiling the Potential of Structure Preserving for Weakly Supervised Object Localization [J].
Pan, Xingjia ;
Gao, Yingguo ;
Lin, Zhiwen ;
Tang, Fan ;
Dong, Weiming ;
Yuan, Haolei ;
Huang, Feiyue ;
Xu, Changsheng .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11637-11646
[25]  
Pan YX, 2023, AAAI CONF ARTIF INTE, P2002
[26]  
Radford A, 2021, PR MACH LEARN RES, V139
[27]  
Rajic F, 2023, Arxiv, DOI [arXiv:2307.01197, 10.48550/arXiv.2307.01197]
[28]   ImageNet Large Scale Visual Recognition Challenge [J].
Russakovsky, Olga ;
Deng, Jia ;
Su, Hao ;
Krause, Jonathan ;
Satheesh, Sanjeev ;
Ma, Sean ;
Huang, Zhiheng ;
Karpathy, Andrej ;
Khosla, Aditya ;
Bernstein, Michael ;
Berg, Alexander C. ;
Fei-Fei, Li .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2015, 115 (03) :211-252
[29]   Unsupervised Object Localization with Representer Point Selection [J].
Song, Yeonghwan ;
Jang, Seokwoo ;
Katabi, Dina ;
Son, Jeany .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, :6511-6521
[30]  
Touvron H, 2021, PR MACH LEARN RES, V139, P7358