Prompt-guided bidirectional deep fusion network for referring image segmentation

被引:1
作者
Wu, Junxian [1 ,2 ]
Zhang, Yujia [1 ]
Kampffmeyer, Michael [3 ]
Zhao, Xiaoguang [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] UiT Arctic Univ Norway, Dept Phys & Technol, Tromso, Norway
基金
中国国家自然科学基金;
关键词
Referring image segmentation; Prompt-guided bidirectional encoder fusion; Prompt-guided cross-modal interaction;
D O I
10.1016/j.neucom.2024.128899
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation involves accurately segmenting objects based on natural language descriptions. This poses challenges due to the intricate and varied nature of language expressions, as well as the requirement to identify relevant image regions among multiple objects. Current models predominantly employ language- aware early fusion techniques, which may lead to misinterpretations of language expressions due to the lack of explicit visual guidance of the language encoder. Additionally, early fusion methods are unable to adequately leverage high-level contexts. To address these limitations, this paper introduces the Prompt-guided Bidirectional Deep Fusion Network (PBDF-Net) to enhance the fusion of language and vision modalities. In contrast to traditional unidirectional early fusion approaches, our approach employs a prompt-guided bidirectional encoder fusion (PBEF) module to promote mutual cross-modal fusion across multiple stages of the vision and language encoders. Furthermore, PBDF-Net incorporates a prompt-guided cross-modal interaction (PCI) module during the late fusion stage, facilitating amore profound integration of contextual information from both modalities, resulting in more accurate target segmentation. Comprehensive experiments conducted on the RefCOCO, RefCOCO+, G-Ref and ReferIt datasets substantiate the efficacy of our proposed method, demonstrating significant advancements in performance compared to existing approaches.
引用
收藏
页数:12
相关论文
共 67 条
[1]  
Bahng Hyojin, 2022, arXiv preprint arXiv:2203.17274
[2]   Language-Based Image Editing with Recurrent Attentive Models [J].
Chen, Jianbo ;
Shen, Yelong ;
Gao, Jianfeng ;
Liu, Jingjing ;
Liu, Xiaodong .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8721-8729
[3]  
Chng Y.X., 2024, PROC IEEECVF C COMPU, P26573
[4]   Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation [J].
Cho, Yubin ;
Yu, Hyunwoo ;
Kang, Suk-Ju .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :5823-5833
[5]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[6]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, DOI 10.48550/ARXIV.1810.04805]
[7]   Bilateral Knowledge Interaction Network for Referring Image Segmentation [J].
Ding, Haixin ;
Zhang, Shengchuan ;
Wu, Qiong ;
Yu, Songlin ;
Hu, Jie ;
Cao, Liujuan ;
Ji, Rongrong .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :2966-2977
[8]   VLT: Vision-Language Transformer and Query Generation for Referring Segmentation [J].
Ding, Henghui ;
Liu, Chang ;
Wang, Suchen ;
Jiang, Xudong .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) :7900-7916
[9]   Self-attention neural architecture search for semantic image segmentation [J].
Fan, Zhenkun ;
Hu, Guosheng ;
Sun, Xin ;
Wang, Gaige ;
Dong, Junyu ;
Su, Chi .
KNOWLEDGE-BASED SYSTEMS, 2022, 239
[10]   Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [J].
Feng, Guang ;
Hu, Zhiwei ;
Zhang, Lihe ;
Lu, Huchuan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15501-15510