Cross-modal transformer with language query for referring image segmentation

被引:9
|
作者
Zhang, Wenjing [1 ]
Tan, Quange [1 ]
Li, Pengxin [1 ]
Zhang, Qi [1 ]
Wang, Rong [1 ,2 ]
机构
[1] Peoples Publ Secur Univ China, Sch Informat & Cyber Secur, Beijing 434020, Peoples R China
[2] Minist Publ Secur, Key Lab Secur Prevent Technol & Risk Assessment, Beijing 434020, Peoples R China
基金
中国国家自然科学基金;
关键词
Referring image segmentation; Deep interaction; Cross -modal transformer; Semantics -guided detail enhancement;
D O I
10.1016/j.neucom.2023.03.011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation (RIS) aims to predict a segmentation mask for a target specified by a nat-ural language expression. However, the existing methods failed to implement deep interaction between vision and language is needed in RIS, resulting inaccurate segmentation. To address the problem, a cross -modal transformer (CMT) with language queries for referring image segmentation is proposed. First, a cross-modal encoder of CMT is designed for intra-modal and inter-modal interaction, capturing context-aware visual features. Secondly, to generate compact visual-aware language queries, a language-query encoder (LQ) embeds key visual cues into linguistic features. In particular, the combina-tion of the cross-modal encoder and language query encoder realizes the mutual guidance of vision and language. Finally, the cross-modal decoder of CMT is constructed to learn multimodal features of the ref-erent from the context-aware visual features and visual-aware language queries. In addition, a semantics-guided detail enhancement (SDE) module is constructed to fuse the semantic-rich multimodal features with detail-rich low-level visual features, which supplements the spatial details of the predicted segmentation masks. Extensive experiments on four referring image segmentation datasets demonstrate the effectiveness of the proposed method.(c) 2023 Elsevier B.V. All rights reserved.
引用
收藏
页码:191 / 205
页数:15
相关论文
共 50 条
  • [1] Decoupled Cross-Modal Transformer for Referring Video Object Segmentation
    Wu, Ao
    Wang, Rong
    Tan, Quange
    Song, Zhenfeng
    SENSORS, 2024, 24 (16)
  • [2] Cross-Modal Recurrent Semantic Comprehension for Referring Image Segmentation
    Shang, Chao
    Li, Hongliang
    Qiu, Heqian
    Wu, Qingbo
    Meng, Fanman
    Zhao, Taijin
    Ngan, King Ngi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (07) : 3229 - 3242
  • [3] Area-keywords cross-modal alignment for referring image segmentation
    Zhang, Huiyong
    Wang, Lichun
    Li, Shuang
    Xu, Kai
    Yin, Baocai
    NEUROCOMPUTING, 2024, 581
  • [4] Cross-Modal Self-Attention Network for Referring Image Segmentation
    Ye, Linwei
    Rochan, Mrigank
    Liu, Zhi
    Wang, Yang
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10494 - 10503
  • [5] Cross-modal attention guided visual reasoning for referring image segmentation
    Zhang, Wenjing
    Hu, Mengnan
    Tan, Quange
    Zhou, Qianli
    Wang, Rong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 28853 - 28872
  • [6] CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation
    Xu, Mingzhu
    Xiao, Tianxiang
    Liu, Yutong
    Tang, Haoyu
    Hu, Yupeng
    Nie, Liqiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (04) : 3234 - 3249
  • [7] Cross-modal attention guided visual reasoning for referring image segmentation
    Wenjing Zhang
    Mengnan Hu
    Quange Tan
    Qianli Zhou
    Rong Wang
    Multimedia Tools and Applications, 2023, 82 : 28853 - 28872
  • [8] Cross-Modal Progressive Comprehension for Referring Segmentation
    Liu, Si
    Hui, Tianrui
    Huang, Shaofei
    Wei, Yunchao
    Li, Bo
    Li, Guanbin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (09) : 4761 - 4775
  • [9] Mixed-scale cross-modal fusion network for referring image segmentation
    Pan, Xiong
    Xie, Xuemei
    Yang, Jianxiu
    NEUROCOMPUTING, 2025, 614
  • [10] Vision-Language Transformer and Query Generation for Referring Segmentation
    Ding, Henghui
    Liu, Chang
    Wang, Suchen
    Jiang, Xudong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 16301 - 16310