Dual Context Perception Transformer for Referring Image Segmentation

被引：0

作者：

Kong, Yuqiu ^{[1
]}

Liu, Junhua ^{[1
]}

Yao, Cuili ^{[1
]}

机构：

[1] Dalian Univ Technol, Dalian 116024, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷

基金：

中国国家自然科学基金;

关键词：

Referring image segmentation; Vision-linguistic alignment; Multi-modal fusion;

D O I：

10.1007/978-981-97-8620-6_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring image segmentation segments target objects in the image according to language expressions. Existing methods mainly make efforts to integrate multi-modal features with attention mechanisms. However, most methods tend to incline to the feature of a single modal during the fusion stage and fall short in exploring cross-modal contextual information, which is critical in localizing accurate target regions. To this end, we propose a novel architecture named Dual Context Perception Transformer (DCPformer) which considers both visual and linguistic contextual information during the fusion and reasoning stages. Specifically, a Cross-modal Context-aware Perception Module (CCPM) is designed to model cross-modal alignment in a unified visual-linguistic representation space. Furthermore, we propose an Information Feedback Module (IFM) that generates a rectification mask based on deep-scale features and filters unrelated signals of the target object in features of shallower scales. Extensive experiments show that the proposed DCP-former achieves state-of-the-art performances against existing methods on three challenging benchmarks.

引用

页码：216 / 230

页数：15

共 50 条

[41] Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [J].

Jiao, Yang ;

Jie, Zequn ;

Luo, Weixin ;

Chen, Jingjing ;

Jiang, Yu-Gang ;

Wei, Xiaolin ;

Ma, Lin .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :1331-1340

[42] Mask prior generation with language queries guided networks for referring image segmentation [J].

Zhou, Jinhao ;

Xiao, Guoqiang ;

Lew, Michael S. ;

Wu, Song .

COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 253

[43] CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation [J].

Xu, Mingzhu ;

Xiao, Tianxiang ;

Liu, Yutong ;

Tang, Haoyu ;

Hu, Yupeng ;

Nie, Liqiang .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (04) :3234-3249

[44] SATR: Semantics-Aware Triadic Refinement network for referring image segmentation [J].

Xie, Jialong ;

Liu, Jin ;

Wang, Guoxiang ;

Zhou, Fengyu .

KNOWLEDGE-BASED SYSTEMS, 2024, 284

[45] Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation [J].

Liu, Chang ;

Ding, Henghui ;

Zhang, Yulun ;

Jiang, Xudong .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 :3054-3065

[46] REFERRING IMAGE SEGMENTATION WITH TWO-STAGE MULTI-MODAL INTERACTION [J].

Wang, Zhenhua ;

Ye, Linwei .

2024 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2024, :2543-2549

[47] AIUnet: Asymptotic inference with U2-Net for referring image segmentation [J].

Li, Jiangquan ;

Shan, Shimin ;

Liu, Yu ;

Xu, Kaiping ;

Hu, Xiwen ;

Xue, Mingcheng .

PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, :24-32

[48] Cross-modal attention guided visual reasoning for referring image segmentation [J].

Wenjing Zhang ;

Mengnan Hu ;

Quange Tan ;

Qianli Zhou ;

Rong Wang .

Multimedia Tools and Applications, 2023, 82 :28853-28872

[49] Prompt-guided bidirectional deep fusion network for referring image segmentation [J].

Wu, Junxian ;

Zhang, Yujia ;

Kampffmeyer, Michael ;

Zhao, Xiaoguang .

NEUROCOMPUTING, 2025, 616

[50] Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [J].

Lei, Sen ;

Xiao, Xinyu ;

Zhang, Tianlin ;

Li, Heng-Chao ;

Shi, Zhenwei ;

Zhu, Qing .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63

← 1 2 3 4 5 →