Dual Context Perception Transformer for Referring Image Segmentation

被引:0
作者
Kong, Yuqiu [1 ]
Liu, Junhua [1 ]
Yao, Cuili [1 ]
机构
[1] Dalian Univ Technol, Dalian 116024, Peoples R China
来源
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷
基金
中国国家自然科学基金;
关键词
Referring image segmentation; Vision-linguistic alignment; Multi-modal fusion;
D O I
10.1007/978-981-97-8620-6_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation segments target objects in the image according to language expressions. Existing methods mainly make efforts to integrate multi-modal features with attention mechanisms. However, most methods tend to incline to the feature of a single modal during the fusion stage and fall short in exploring cross-modal contextual information, which is critical in localizing accurate target regions. To this end, we propose a novel architecture named Dual Context Perception Transformer (DCPformer) which considers both visual and linguistic contextual information during the fusion and reasoning stages. Specifically, a Cross-modal Context-aware Perception Module (CCPM) is designed to model cross-modal alignment in a unified visual-linguistic representation space. Furthermore, we propose an Information Feedback Module (IFM) that generates a rectification mask based on deep-scale features and filters unrelated signals of the target object in features of shallower scales. Extensive experiments show that the proposed DCP-former achieves state-of-the-art performances against existing methods on three challenging benchmarks.
引用
收藏
页码:216 / 230
页数:15
相关论文
共 50 条
[31]   Text-Vision Relationship Alignment for Referring Image Segmentation [J].
Mingxing Pu ;
Bing Luo ;
Chao Zhang ;
Li Xu ;
Fayou Xu ;
Mingming Kong .
Neural Processing Letters, 56
[32]   Rethinking Cross-Modal Interaction for Efficient Referring Image Segmentation [J].
Cuttano, Claudia ;
Pistilli, Francesca ;
Cermelli, Fabio ;
Averta, Giuseppe .
IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (08) :7811-7818
[33]   Cross-Modal Recurrent Semantic Comprehension for Referring Image Segmentation [J].
Shang, Chao ;
Li, Hongliang ;
Qiu, Heqian ;
Wu, Qingbo ;
Meng, Fanman ;
Zhao, Taijin ;
Ngan, King Ngi .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (07) :3229-3242
[34]   Referring Image Segmentation With Fine-Grained Semantic Funneling Infusion [J].
Yang, Jiaxing ;
Zhang, Lihe ;
Lu, Huchuan .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) :14727-14738
[35]   CMF: CASCADED MULTI-MODEL FUSION FOR REFERRING IMAGE SEGMENTATION [J].
Yang, Jianhua ;
Huang, Yan ;
Ma, Zhanyu ;
Wang, Liang .
2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, :2289-2293
[36]   GENERATIVE ADVERSARIAL NETWORK INCLUDING REFERRING IMAGE SEGMENTATION FOR TEXT-GUIDED IMAGE MANIPULATION [J].
Watanabe, Yuto ;
Togo, Ren ;
Maeda, Keisuke ;
Ogawa, Takahiro ;
Haseyama, Miki .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :4818-4822
[37]   Area-keywords cross-modal alignment for referring image segmentation [J].
Zhang, Huiyong ;
Wang, Lichun ;
Li, Shuang ;
Xu, Kai ;
Yin, Baocai .
NEUROCOMPUTING, 2024, 581
[38]   TOWARDS GENERALIZABLE REFERRING IMAGE SEGMENTATION VIA TARGET PROMPT AND VISUAL COHERENCE [J].
Liu, Yajie ;
Ge, Pu ;
Ma, Haoxiang ;
Fan, Shichao ;
Liu, Qingjie ;
Huang, Di ;
Wang, Yunhong .
2024 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2024, :2599-2605
[39]   Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation [J].
Shi, Leideng ;
Zhang, Juan .
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2025, 22
[40]   Cross-modal attention guided visual reasoning for referring image segmentation [J].
Zhang, Wenjing ;
Hu, Mengnan ;
Tan, Quange ;
Zhou, Qianli ;
Wang, Rong .
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) :28853-28872