Fuse and Calibrate: A Bi-directional Vision-Language Guided Framework for Referring Image Segmentation

被引：0

作者：

Yan, Yichen ^{[1
,2
]}

He, Xingjian ^{[1
]}

Chen, Sihan ^{[2
]}

Lu, Shichen ^{[3
]}

Liu, Jing ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

[3] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XI, ICIC 2024 | 2024年 / 14872卷

基金：

中国国家自然科学基金;

关键词：

Referring Image Segmentation; Vision-Language Models; Fusion & Calibration;

D O I：

10.1007/978-981-97-5612-4_27

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods.

引用

页码：313 / 324

页数：12

共 5 条

[1] Vision-Aware Language Reasoning for Referring Image Segmentation
Xu, Fayou
Luo, Bing
Zhang, Chao
Xu, Li
Pu, Mingxing
Li, Bo
NEURAL PROCESSING LETTERS, 2023, 55 (08) : 11313 - 11331
[2] Vision-Aware Language Reasoning for Referring Image Segmentation
Fayou Xu
Bing Luo
Chao Zhang
Li Xu
Mingxing Pu
Bo Li
Neural Processing Letters, 2023, 55 : 11313 - 11331
[3] Mask prior generation with language queries guided networks for referring image segmentation
Zhou, Jinhao
Xiao, Guoqiang
Lew, Michael S.
Wu, Song
COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 253
[4] Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation
Cho, Yubin
Yu, Hyunwoo
Kang, Suk-Ju
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5823 - 5833
[5] SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification
Peng, Fang
Yang, Xiaoshan
Xiao, Linhui
Wang, Yaowei
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3469 - 3480

← 1 →