Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval

被引:0
作者
Tang, Xu [1 ]
Huang, Dabiao [1 ]
Ma, Jingjing [1 ]
Zhang, Xiangrong [1 ]
Liu, Fang [2 ]
Jiao, Licheng [1 ]
机构
[1] Xidian Univ, Minist Educ, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Peoples R China
[2] Nanjing Univ Sci & Technol, Minist Educ, Sch Comp Sci & Engn, Key Lab Intelligent Percept & Systemsfor High Dime, Nanjing 210094, Peoples R China
来源
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2024年 / 62卷
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Transformers; Semantics; Training; Convolutional neural networks; Recurrent neural networks; Learning from prior experiences (LPEs); multiscale feature fusion; remote sensing image-text retrieval (RSITR); transformer; BIG DATA; FUSION;
D O I
10.1109/TGRS.2024.3464468
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Remote sensing (RS) image-text retrieval (RSITR) aims to retrieve relevant texts (RS images) based on the content of a given RS image (text). Existing methods are used to employing the convolutional neural network (CNN) and recurrent neural network (RNN) as encoders to learn visual and textual features for retrieval. Although feasible, the global information hidden in different data does not receive the attention it deserves. To mitigate this problem, transformers have been introduced. Nevertheless, the complexity of RS images present challenges in directly introducing Transformer-based architectures to multimodal learning in RS scenes, particularly in visual feature extraction and cross-modal interaction. In addition, the textual captions are always simpler than the complex RS images, leading to a semantic description appearing in different images. This typical false-negative (FN) sample problem increases the difficulty of RSITR tasks. To address the above limitations, we propose a new RSITR model named prior-experience-based RS vision-language (PERSVL). First, the specific visual and text encoders are used to extract features from RS images and texts. Also, a high-level feature complement (HFC) module is developed based on the self-attention mechanism (SAM) for the visual encoder to explore the complex contents from RS images fully. Second, a dual-branch multimodal fusion encoder (DBMFE) is designed to complete the cross-modal learning. It comprises a dual-branch multimodal interaction (DBMI) module and a branch fusion module. DBMI is designed to fully explore the relationships between different modalities, enriching visual and textual features. The branch fusion module integrates the cross-modal features and utilizes a classification head to generate matching scores for retrieval. Finally, a learning from prior experiences (LPEs) module is designed to reduce the influence of FN samples by analyzing the historical data produced in the model training process. Experiments are conducted on three popular datasets, and the positive results show that our PERSVL model achieves superior performance compared with previous methods. By integrating the advantages of natural language and RS images, our PERSVL can be applied in various applications, such as environmental monitoring, disaster evaluation, and urban planning. Our source codes are available at: https://github.com/TangXu-Group/Cross-modal-remote-sensing-image-and-text-retrieval-models/tree/main/PERSVL.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks
    Cheng, Qingrong
    Wen, Keyu
    Gu, Xiaodong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7062 - 7075
  • [22] SWINT-RESNet: An Improved Remote Sensing Image Segmentation Model Based on Transformer
    Ma, Yuefeng
    Wang, Yingli
    Liu, Xingya
    Wang, Haiying
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21
  • [23] A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing
    Yuan, Zhiqiang
    Zhang, Wenkai
    Rong, Xuee
    Li, Xuan
    Chen, Jialiang
    Wang, Hongqi
    Fu, Kun
    Sun, Xian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [24] Knowledge-Aware Text-Image Retrieval for Remote Sensing Images
    Mi, Li
    Dai, Xianjie
    Castillo-Navarro, Javiera
    Tuia, Devis
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [25] Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and Large Vision-Language Model
    Liu, Ye
    Pan, Yan
    Yin, Jian
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2550 - 2554
  • [26] MULTIWAY-ADAPTER: ADAPTING MULTIMODAL LARGE LANGUAGE MODELS FOR SCALABLE IMAGE-TEXT RETRIEVAL
    Long, Zijun
    Killick, George
    McCreadie, Richard
    Camarasa, Gerardo Aragon
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6580 - 6584
  • [27] Hash-Based Remote Sensing Image Retrieval
    Han, Lirong
    Paoletti, Mercedes E.
    Tao, Xuanwen
    Wu, Zhaoyue
    Haut, Juan M.
    Li, Peng
    Pastor-Vargas, R.
    Plaza, Antonio
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [28] Hierarchical Knowledge-Based Graph Embedding Model for Image-Text Matching in IoTs
    Zhang, Lizong
    Li, Meng
    Yan, Ke
    Wang, Ruozhou
    Hui, Bei
    IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (12) : 9399 - 9409
  • [29] RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks
    Wang, Peijin
    Hu, Huiyang
    Tong, Boyuan
    Zhang, Ziqi
    Yao, Fanglong
    Feng, Yingchao
    Zhu, Zining
    Chang, Hao
    Diao, Wenhui
    Ye, Qixiang
    Sun, Xian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [30] Fusion-Based Correlation Learning Model for Cross-Modal Remote Sensing Image Retrieval
    Lv, Yafei
    Xiong, Wei
    Zhang, Xiaohan
    Cui, Yaqi
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19