Prompt-guided bidirectional deep fusion network for referring image segmentation

被引：1

作者：

Wu, Junxian ^{[1
,2
]}

Zhang, Yujia ^{[1
]}

Kampffmeyer, Michael ^{[3
]}

Zhao, Xiaoguang ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

[3] UiT Arctic Univ Norway, Dept Phys & Technol, Tromso, Norway

来源：

NEUROCOMPUTING | 2025年 / 616卷

基金：

中国国家自然科学基金;

关键词：

Referring image segmentation; Prompt-guided bidirectional encoder fusion; Prompt-guided cross-modal interaction;

D O I：

10.1016/j.neucom.2024.128899

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring image segmentation involves accurately segmenting objects based on natural language descriptions. This poses challenges due to the intricate and varied nature of language expressions, as well as the requirement to identify relevant image regions among multiple objects. Current models predominantly employ language- aware early fusion techniques, which may lead to misinterpretations of language expressions due to the lack of explicit visual guidance of the language encoder. Additionally, early fusion methods are unable to adequately leverage high-level contexts. To address these limitations, this paper introduces the Prompt-guided Bidirectional Deep Fusion Network (PBDF-Net) to enhance the fusion of language and vision modalities. In contrast to traditional unidirectional early fusion approaches, our approach employs a prompt-guided bidirectional encoder fusion (PBEF) module to promote mutual cross-modal fusion across multiple stages of the vision and language encoders. Furthermore, PBDF-Net incorporates a prompt-guided cross-modal interaction (PCI) module during the late fusion stage, facilitating amore profound integration of contextual information from both modalities, resulting in more accurate target segmentation. Comprehensive experiments conducted on the RefCOCO, RefCOCO+, G-Ref and ReferIt datasets substantiate the efficacy of our proposed method, demonstrating significant advancements in performance compared to existing approaches.

引用

页数：12

共 67 条

[1]

Bahng Hyojin, 2022, arXiv preprint arXiv:2203.17274

[2] Language-Based Image Editing with Recurrent Attentive Models [J].