Dual Context Perception Transformer for Referring Image Segmentation

被引:0
作者
Kong, Yuqiu [1 ]
Liu, Junhua [1 ]
Yao, Cuili [1 ]
机构
[1] Dalian Univ Technol, Dalian 116024, Peoples R China
来源
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷
基金
中国国家自然科学基金;
关键词
Referring image segmentation; Vision-linguistic alignment; Multi-modal fusion;
D O I
10.1007/978-981-97-8620-6_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation segments target objects in the image according to language expressions. Existing methods mainly make efforts to integrate multi-modal features with attention mechanisms. However, most methods tend to incline to the feature of a single modal during the fusion stage and fall short in exploring cross-modal contextual information, which is critical in localizing accurate target regions. To this end, we propose a novel architecture named Dual Context Perception Transformer (DCPformer) which considers both visual and linguistic contextual information during the fusion and reasoning stages. Specifically, a Cross-modal Context-aware Perception Module (CCPM) is designed to model cross-modal alignment in a unified visual-linguistic representation space. Furthermore, we propose an Information Feedback Module (IFM) that generates a rectification mask based on deep-scale features and filters unrelated signals of the target object in features of shallower scales. Extensive experiments show that the proposed DCP-former achieves state-of-the-art performances against existing methods on three challenging benchmarks.
引用
收藏
页码:216 / 230
页数:15
相关论文
共 50 条
[21]   PRNet: A Progressive Refinement Network for referring image segmentation [J].
Liu, Jing ;
Jiang, Huajie ;
Hu, Yongli ;
Yin, Baocai .
NEUROCOMPUTING, 2025, 630
[22]   Bilateral Knowledge Interaction Network for Referring Image Segmentation [J].
Ding, Haixin ;
Zhang, Shengchuan ;
Wu, Qiong ;
Yu, Songlin ;
Hu, Jie ;
Cao, Liujuan ;
Ji, Rongrong .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :2966-2977
[23]   Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation [J].
Cho, Yubin ;
Yu, Hyunwoo ;
Kang, Suk-Ju .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :5823-5833
[24]   Local-global coordination with transformers for referring image segmentation [J].
Liu, Fang ;
Kong, Yuqiu ;
Zhang, Lihe ;
Feng, Guang ;
Yin, Baocai .
NEUROCOMPUTING, 2023, 522 :39-52
[25]   Text-Vision Relationship Alignment for Referring Image Segmentation [J].
Pu, Mingxing ;
Luo, Bing ;
Zhang, Chao ;
Xu, Li ;
Xu, Fayou ;
Kong, Mingming .
NEURAL PROCESSING LETTERS, 2024, 56 (02)
[26]   Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation [J].
Yan, Yichen ;
He, Xingjian ;
Chen, Sihan ;
Liu, Jing .
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, :451-459
[27]   Vision-Aware Language Reasoning for Referring Image Segmentation [J].
Xu, Fayou ;
Luo, Bing ;
Zhang, Chao ;
Xu, Li ;
Pu, Mingxing ;
Li, Bo .
NEURAL PROCESSING LETTERS, 2023, 55 (08) :11313-11331
[28]   ReferSAM: Unleashing Segment Anything Model for Referring Image Segmentation [J].
Liu, Sun-Ao ;
Xie, Hongtao ;
Ge, Jiannan ;
Zhang, Yongdong .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (05) :4910-4922
[29]   Global Selection and Local Attention Network for Referring Image Segmentation [J].
Ding, Haixin ;
Zhang, Shengchuan ;
Cao, Liujuan .
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 :284-295
[30]   Vision-Aware Language Reasoning for Referring Image Segmentation [J].
Fayou Xu ;
Bing Luo ;
Chao Zhang ;
Li Xu ;
Mingxing Pu ;
Bo Li .
Neural Processing Letters, 2023, 55 :11313-11331