Dual Context Perception Transformer for Referring Image Segmentation

被引：0

作者：

Kong, Yuqiu ^{[1
]}

Liu, Junhua ^{[1
]}

Yao, Cuili ^{[1
]}

机构：

[1] Dalian Univ Technol, Dalian 116024, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷

基金：

中国国家自然科学基金;

关键词：

Referring image segmentation; Vision-linguistic alignment; Multi-modal fusion;

D O I：

10.1007/978-981-97-8620-6_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring image segmentation segments target objects in the image according to language expressions. Existing methods mainly make efforts to integrate multi-modal features with attention mechanisms. However, most methods tend to incline to the feature of a single modal during the fusion stage and fall short in exploring cross-modal contextual information, which is critical in localizing accurate target regions. To this end, we propose a novel architecture named Dual Context Perception Transformer (DCPformer) which considers both visual and linguistic contextual information during the fusion and reasoning stages. Specifically, a Cross-modal Context-aware Perception Module (CCPM) is designed to model cross-modal alignment in a unified visual-linguistic representation space. Furthermore, we propose an Information Feedback Module (IFM) that generates a rectification mask based on deep-scale features and filters unrelated signals of the target object in features of shallower scales. Extensive experiments show that the proposed DCP-former achieves state-of-the-art performances against existing methods on three challenging benchmarks.

引用

页码：216 / 230

页数：15