REFERRING IMAGE SEGMENTATION WITH TWO-STAGE MULTI-MODAL INTERACTION

被引:0
作者
Wang, Zhenhua [1 ]
Ye, Linwei [1 ]
机构
[1] Wenzhou Univ, Coll Comp Sci & Artificial Intelligence, Wenzhou, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2024年
基金
中国国家自然科学基金;
关键词
Vision and Language; Referring Image Segmentation;
D O I
10.1109/ICIP51287.2024.10647356
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The objective of referring image segmentation is to extract referred entities from an image using a particular natural language sentence. The main idea for this task is interacting textual and visual features to build multi-modal relationships. The prior state-of-the-art methods mainly focus on local multi-level intermediate feature interaction or global text-to-image alignment, which might result in insufficient interaction for capturing global multi-modal information exchange or fine-grained referred object details, respectively. To overcome this issue, we introduce a referring image segmentation framework with two-stage multi-modal interaction. Specifically, we devise an innovative multi-level cross-modal fusion module to effectively facilitate the interaction of intermediate features of linguistic and visual modalities for fine-grained details of referred objects. Besides, we further align the linguistic and visual information by introducing an elaborate global alignment module for accurately localizing the entire referred objects. The comprehensive experiments conducted on three referring image segmentation datasets illustrate that our proposed two-stage multi-modal interaction framework exhibits a marked superiority over the contemporary state-of-the-art approaches.
引用
收藏
页码:2543 / 2549
页数:7
相关论文
共 30 条
[1]  
Ba JL, 2016, arXiv
[2]   InstructPix2Pix: Learning to Follow Image Editing Instructions [J].
Brooks, Tim ;
Holynski, Aleksander ;
Efros, Alexei A. .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18392-18402
[3]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[4]   Vision-Language Transformer and Query Generation for Referring Segmentation [J].
Ding, Henghui ;
Liu, Chang ;
Wang, Suchen ;
Jiang, Xudong .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :16301-16310
[5]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[6]   Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [J].
Feng, Guang ;
Hu, Zhiwei ;
Zhang, Lihe ;
Lu, Huchuan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15501-15510
[7]   Dual Attention Network for Scene Segmentation [J].
Fu, Jun ;
Liu, Jing ;
Tian, Haijie ;
Li, Yong ;
Bao, Yongjun ;
Fang, Zhiwei ;
Lu, Hanqing .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3141-3149
[8]  
Hessel J., 2021, arXiv
[9]   Locate then Segment: A Strong Pipeline for Referring Image Segmentation [J].
Jing, Ya ;
Kong, Tao ;
Wang, Wei ;
Wang, Liang ;
Li, Lei ;
Tan, Tieniu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :9853-9862
[10]  
Kazemzadeh S., 2014, C EMP METH NAT LANG, P787, DOI DOI 10.3115/V1/D14-1086.URL