Fuse and Calibrate: A Bi-directional Vision-Language Guided Framework for Referring Image Segmentation

被引:0
|
作者
Yan, Yichen [1 ,2 ]
He, Xingjian [1 ]
Chen, Sihan [2 ]
Lu, Shichen [3 ]
Liu, Jing [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
来源
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XI, ICIC 2024 | 2024年 / 14872卷
基金
中国国家自然科学基金;
关键词
Referring Image Segmentation; Vision-Language Models; Fusion & Calibration;
D O I
10.1007/978-981-97-5612-4_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods.
引用
收藏
页码:313 / 324
页数:12
相关论文
共 5 条
  • [1] Vision-Aware Language Reasoning for Referring Image Segmentation
    Xu, Fayou
    Luo, Bing
    Zhang, Chao
    Xu, Li
    Pu, Mingxing
    Li, Bo
    NEURAL PROCESSING LETTERS, 2023, 55 (08) : 11313 - 11331
  • [2] Vision-Aware Language Reasoning for Referring Image Segmentation
    Fayou Xu
    Bing Luo
    Chao Zhang
    Li Xu
    Mingxing Pu
    Bo Li
    Neural Processing Letters, 2023, 55 : 11313 - 11331
  • [3] Mask prior generation with language queries guided networks for referring image segmentation
    Zhou, Jinhao
    Xiao, Guoqiang
    Lew, Michael S.
    Wu, Song
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 253
  • [4] Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation
    Cho, Yubin
    Yu, Hyunwoo
    Kang, Suk-Ju
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5823 - 5833
  • [5] SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification
    Peng, Fang
    Yang, Xiaoshan
    Xiao, Linhui
    Wang, Yaowei
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3469 - 3480