Dual Context Perception Transformer for Referring Image Segmentation

被引:0
作者
Kong, Yuqiu [1 ]
Liu, Junhua [1 ]
Yao, Cuili [1 ]
机构
[1] Dalian Univ Technol, Dalian 116024, Peoples R China
来源
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷
基金
中国国家自然科学基金;
关键词
Referring image segmentation; Vision-linguistic alignment; Multi-modal fusion;
D O I
10.1007/978-981-97-8620-6_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation segments target objects in the image according to language expressions. Existing methods mainly make efforts to integrate multi-modal features with attention mechanisms. However, most methods tend to incline to the feature of a single modal during the fusion stage and fall short in exploring cross-modal contextual information, which is critical in localizing accurate target regions. To this end, we propose a novel architecture named Dual Context Perception Transformer (DCPformer) which considers both visual and linguistic contextual information during the fusion and reasoning stages. Specifically, a Cross-modal Context-aware Perception Module (CCPM) is designed to model cross-modal alignment in a unified visual-linguistic representation space. Furthermore, we propose an Information Feedback Module (IFM) that generates a rectification mask based on deep-scale features and filters unrelated signals of the target object in features of shallower scales. Extensive experiments show that the proposed DCP-former achieves state-of-the-art performances against existing methods on three challenging benchmarks.
引用
收藏
页码:216 / 230
页数:15
相关论文
共 50 条
[41]   Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [J].
Jiao, Yang ;
Jie, Zequn ;
Luo, Weixin ;
Chen, Jingjing ;
Jiang, Yu-Gang ;
Wei, Xiaolin ;
Ma, Lin .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :1331-1340
[42]   Mask prior generation with language queries guided networks for referring image segmentation [J].
Zhou, Jinhao ;
Xiao, Guoqiang ;
Lew, Michael S. ;
Wu, Song .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 253
[43]   CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation [J].
Xu, Mingzhu ;
Xiao, Tianxiang ;
Liu, Yutong ;
Tang, Haoyu ;
Hu, Yupeng ;
Nie, Liqiang .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (04) :3234-3249
[44]   SATR: Semantics-Aware Triadic Refinement network for referring image segmentation [J].
Xie, Jialong ;
Liu, Jin ;
Wang, Guoxiang ;
Zhou, Fengyu .
KNOWLEDGE-BASED SYSTEMS, 2024, 284
[45]   Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation [J].
Liu, Chang ;
Ding, Henghui ;
Zhang, Yulun ;
Jiang, Xudong .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 :3054-3065
[46]   REFERRING IMAGE SEGMENTATION WITH TWO-STAGE MULTI-MODAL INTERACTION [J].
Wang, Zhenhua ;
Ye, Linwei .
2024 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2024, :2543-2549
[47]   AIUnet: Asymptotic inference with U2-Net for referring image segmentation [J].
Li, Jiangquan ;
Shan, Shimin ;
Liu, Yu ;
Xu, Kaiping ;
Hu, Xiwen ;
Xue, Mingcheng .
PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, :24-32
[48]   Cross-modal attention guided visual reasoning for referring image segmentation [J].
Wenjing Zhang ;
Mengnan Hu ;
Quange Tan ;
Qianli Zhou ;
Rong Wang .
Multimedia Tools and Applications, 2023, 82 :28853-28872
[49]   Prompt-guided bidirectional deep fusion network for referring image segmentation [J].
Wu, Junxian ;
Zhang, Yujia ;
Kampffmeyer, Michael ;
Zhao, Xiaoguang .
NEUROCOMPUTING, 2025, 616
[50]   Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [J].
Lei, Sen ;
Xiao, Xinyu ;
Zhang, Tianlin ;
Li, Heng-Chao ;
Shi, Zhenwei ;
Zhu, Qing .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63