ReferSAM: Unleashing Segment Anything Model for Referring Image Segmentation

被引：0

作者：

Liu, Sun-Ao ^{[1
]}

Xie, Hongtao ^{[1
]}

Ge, Jiannan ^{[1
]}

Zhang, Yongdong ^{[1
]}

机构：

[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 05期

关键词：

Visualization; Image segmentation; Decoding; Linguistics; Feature extraction; Computational modeling; Vectors; Transformers; Image coding; Circuits and systems; Referring image segmentation; segment anything; vision-language interactor; prompt learning;

D O I：

10.1109/TCSVT.2024.3524543

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The Segment Anything Model (SAM) has demonstrated remarkable capability as a general segmentation model given visual prompts such as points or boxes. While SAM is conceptually compatible with text prompts, it merely employs linguistic features from vision-language models as prompt embeddings and lacks fine-grained cross-modal interaction. This deficiency limits its application in referring image segmentation (RIS), where the targets are specified by free-form natural language expressions. In this paper, we introduce ReferSAM, a novel SAM-based framework that enhances cross-modal interaction and reformulates prompt encoding, thereby unleashing SAM's segmentation capability for RIS. Specifically, ReferSAM incorporates the Vision-Language Interactor (VLI) to integrate linguistic features with visual features during the image encoding stage of SAM. This interactor introduces fine-grained alignment between linguistic features and multi-scale visual representations without altering the architecture of pre-trained models. Additionally, we present the Vision-Language Prompter (VLP) to generate dense and sparse prompt embeddings by aggregating the aligned linguistic and visual features. Consequently, the generated embeddings sufficiently prompt SAM's mask decoder to provide precise segmentation results. Extensive experiments on five public benchmarks demonstrate that ReferSAM achieves state-of-the-art performance on both classic and generalized RIS tasks. The code and models are available at https://github.com/lsa1997/ReferSAM.

引用

页码：4910 / 4922

页数：13

共 86 条

[1]

Bai JZ, 2023, Arxiv, DOI arXiv:2308.12966

[2]

Cen Jiazhong, 2023, Advances in Neural Information Processing Systems

[3] See-Through-Text Grouping for Referring Image Segmentation [J].

Chen, Ding-Jie ;

Jia, Songhao ;

Lo, Yi-Chen ;

Chen, Hwann-Tzong ;

Liu, Tyng-Luh .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7453-7462

[4] UNITER: UNiversal Image-TExt Representation Learning [J].

Chen, Yen-Chun ;

Li, Linjie ;

Yu, Licheng ;

El Kholy, Ahmed ;

Ahmed, Faisal ;

Gan, Zhe ;

Cheng, Yu ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120

[5]

Chen Z., 2022, P 11 INT C LEARN REP, P1

[6] Masked-attention Mask Transformer for Universal Image Segmentation [J].

Cheng, Bowen ;

Misra, Ishan ;

Schwing, Alexander G. ;

Kirillov, Alexander ;

Girdhar, Rohit .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1280-1289

[7]

Devlin J, 2019, Arxiv, DOI arXiv:1810.04805

[8] Vision-Language Transformer and Query Generation for Referring Segmentation [J].

Ding, Henghui ;

Liu, Chang ;

Wang, Suchen ;

Jiang, Xudong .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :16301-16310

[9]

Ding YH, 2025, IEEE T CIRC SYST VID, V35, P2975, DOI 10.1109/TCSVT.2024.3384503

[10] EVA: Exploring the Limits of Masked Visual Representation Learning at Scale [J].

Fang, Yuxin ;

Wang, Wen ;

Xie, Binhui ;

Sun, Quan ;

Wu, Ledell ;

Wang, Xinggang ;

Huang, Tiejun ;

Wang, Xinlong ;

Cao, Yue .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :19358-19369

← 1 2 3 4 5 6 7 8 9 →