Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

被引:0
作者
Lei, Sen [1 ]
Xiao, Xinyu [2 ]
Zhang, Tianlin [3 ]
Li, Heng-Chao [1 ]
Shi, Zhenwei [4 ]
Zhu, Qing [5 ]
机构
[1] Southwest Jiaotong Univ, Sch Informat Sci & Technol, Chengdu 611756, Peoples R China
[2] Co Ant Grp, Hangzhou 688688, Peoples R China
[3] AVIC, Luoyang Inst Electroopt Equipment, Luoyang 471000, Peoples R China
[4] Beihang Univ, Image Proc Ctr, Sch Astronaut, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[5] Southwest Jiaotong Univ, Fac Geosci & Engn, Chengdu 611756, Peoples R China
来源
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2025年 / 63卷
基金
中国国家自然科学基金;
关键词
Remote sensing; Image segmentation; Visualization; Feature extraction; Linguistics; Transformers; Electronic mail; Adaptation models; Object recognition; Grounding; Fine-grained image-text alignment; referring image segmentation; remote sensing images; CLASSIFICATION; NETWORK;
D O I
10.1109/TGRS.2024.3522293
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Given a language expression, referring remote sensing image segmentation (RRSIS) aims to identify ground objects and assign pixelwise labels within the imagery. One of the key challenges for this task is to capture discriminative multimodal features via image-text alignment. However, the existing RRSIS methods use one vanilla and coarse alignment, where the language expression is directly extracted to be fused with the visual features. In this article, we argue that a "fine-grained image-text alignment" can improve the extraction of multimodal information. To this point, we propose a new RRSIS method to fully exploit the visual and linguistic representations. Specifically, the original referring expression is regarded as context text, which is further decoupled into the ground object and spatial position texts. The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts, obtaining better discriminative multimodal representation. Meanwhile, to handle the various scales of ground objects in remote sensing, we introduce a text-aware multiscale enhancement module (TMEM) to adaptively perform cross-scale fusion and intersections. We evaluate the effectiveness of the proposed method on two public referring remote sensing datasets including RefSegRS and RRSIS-D, and our method obtains superior performance over several state-of-the-art methods. The code will be publicly available at https://github.com/Shaosifan/FIANet.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Ba J.L., 2016, arXiv
  • [2] Bird S., 2006, P COLING ACL INT PRE, P214
  • [3] BEMRF-Net: Boundary Enhancement and Multiscale Refinement Fusion for Building Extraction From Remote Sensing Imagery
    Cao, Shaohan
    Feng, Dejun
    Liu, Suning
    Xu, Wanqi
    Chen, Hongyu
    Xie, Yakun
    Zhang, Heng
    Pirasteh, Saied
    Zhu, Jun
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 16342 - 16358
  • [4] Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation
    Cho, Yubin
    Yu, Hyunwoo
    Kang, Suk-Ju
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5823 - 5833
  • [5] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [6] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [7] Remote Sensing Object Detection in the Deep Learning Era-A Review
    Gui, Shengxi
    Song, Shuang
    Qin, Rongjun
    Tang, Yang
    [J]. REMOTE SENSING, 2024, 16 (02)
  • [8] Hendrycks D, 2020, Arxiv, DOI arXiv:1606.08415
  • [9] Segmentation from Natural Language Expressions
    Hu, Ronghang
    Rohrbach, Marcus
    Darrell, Trevor
    [J]. COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 : 108 - 124
  • [10] Bi-directional Relationship Inferring Network for Referring Image Segmentation
    Hu, Zhiwei
    Feng, Guang
    Sun, Jiayu
    Zhang, Lihe
    Lu, Huchuan
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4423 - 4432