CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation

被引：1

作者：

Xu, Mingzhu ^{[1
]}

Xiao, Tianxiang ^{[1
]}

Liu, Yutong ^{[1
]}

Tang, Haoyu ^{[1
]}

Hu, Yupeng ^{[1
]}

Nie, Liqiang ^{[2
]}

机构：

[1] Shandong Univ, Sch Software, Jinan 250101, Shandong, Peoples R China

[2] Harbin Inst Technol Shenzhen, Sch Comp Sci & Technol, Shenzhen 518055, Guangdong, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Visualization; Feature extraction; Semantics; Linguistics; Cognition; Decoding; Circuits and systems; Encoding; Semantic segmentation; Computer architecture; Referring image segmentation; vision and language; cross modal reasoning; graph neural network;

D O I：

10.1109/TCSVT.2024.3508752

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Referring Image Segmentation (RIS) aims to semantically segment the target object (referent) in alignment with the provided natural language query. Existing works still suffer from that the non-referent was segmented mistakenly, which can be attributed to the insufficient comprehension of vision and language. To tackle this problem, we propose a Cross-Modal Interactive Reasoning Network (CMIRNet) to explore semantic information that consistently existed between vision and language. Specifically, we first devise a novel Text-Guided Multi-Modality Joint Encoder (TGMM-JE), where the key expression can be extracted and the important visual features will be encoded under the continuous guidance of language expression. Then, we design a Cross-Graph Interactive Positioning (CGIP) module to locate the key pixels of the referent object in deepest layer. The multi-modality graph data is constructed between visual and linguistic features, and the important pixels can be positioned from cross-graph interaction and intra-graph reasoning. Finally, a novel Cross-Modal Attention Enhanced DEcoder (CMAE-DE) is dedicated to refine the referent object mask from coarse to fine progressively, where hybrid cross modal attentions are explored to enhance the representation of referent object. Extensive ablation studies validate the efficacy of our key modules and comprehensive experimental results show the superiority of our proposed model over 22 state-of-the-art (SOTA) models.

引用

页码：3234 / 3249

页数：16

共 77 条

[1] SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation [J].

Badrinarayanan, Vijay ;

Kendall, Alex ;

Cipolla, Roberto .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) :2481-2495

[2]

Bellver M, 2020, Arxiv, DOI arXiv:2010.00263

[3]

Bruna J., 2013, SPECTRAL NETWORKS LO

[4] Roundness-Preserving Warping for Aesthetic Enhancement-Based Stereoscopic Image Editing [J].

Chai, Xiongli ;

Shao, Feng ;

Jiang, Qiuping ;

Ho, Yo-Sung .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (04) :1463-1477

[5] Language-Based Image Editing with Recurrent Attentive Models [J].

Chen, Jianbo ;

Shen, Yelong ;

Gao, Jianfeng ;

Liu, Jingjing ;

Liu, Xiaodong .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8721-8729

[6]

Chen LC, 2017, Arxiv, DOI [arXiv:1706.05587, 10.48550/arXiv.1706.05587]

[7] Coupled Multimodal Emotional Feature Analysis Based on Broad-Deep Fusion Networks in Human-Robot Interaction [J].

Chen, Luefeng ;

Li, Min ;

Wu, Min ;

Pedrycz, Witold ;

Hirota, Kaoru .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (07) :9663-9673

[8] Multi-Modal Dynamic Graph Transformer for Visual Grounding [J].

Chen, Sijia ;

Li, Baochun .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15513-15522

[9] Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation [J].

Cho, Yubin ;

Yu, Hyunwoo ;

Kang, Suk-Ju .

IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :5823-5833

[10]

Defferrard M, 2016, ADV NEUR IN, V29

← 1 2 3 4 5 6 7 8 →