Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval

被引：0

作者：

Tang, Xu ^{[1
]}

Huang, Dabiao ^{[1
]}

Ma, Jingjing ^{[1
]}

Zhang, Xiangrong ^{[1
]}

Liu, Fang ^{[2
]}

Jiao, Licheng ^{[1
]}

机构：

[1] Xidian Univ, Minist Educ, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Peoples R China

[2] Nanjing Univ Sci & Technol, Minist Educ, Sch Comp Sci & Engn, Key Lab Intelligent Percept & Systemsfor High Dime, Nanjing 210094, Peoples R China

来源：

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2024年 / 62卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Feature extraction; Transformers; Semantics; Training; Convolutional neural networks; Recurrent neural networks; Learning from prior experiences (LPEs); multiscale feature fusion; remote sensing image-text retrieval (RSITR); transformer; BIG DATA; FUSION;

D O I：

10.1109/TGRS.2024.3464468

中图分类号：

P3 [地球物理学]; P59 [地球化学];

学科分类号：

0708 ; 070902 ;

摘要：

Remote sensing (RS) image-text retrieval (RSITR) aims to retrieve relevant texts (RS images) based on the content of a given RS image (text). Existing methods are used to employing the convolutional neural network (CNN) and recurrent neural network (RNN) as encoders to learn visual and textual features for retrieval. Although feasible, the global information hidden in different data does not receive the attention it deserves. To mitigate this problem, transformers have been introduced. Nevertheless, the complexity of RS images present challenges in directly introducing Transformer-based architectures to multimodal learning in RS scenes, particularly in visual feature extraction and cross-modal interaction. In addition, the textual captions are always simpler than the complex RS images, leading to a semantic description appearing in different images. This typical false-negative (FN) sample problem increases the difficulty of RSITR tasks. To address the above limitations, we propose a new RSITR model named prior-experience-based RS vision-language (PERSVL). First, the specific visual and text encoders are used to extract features from RS images and texts. Also, a high-level feature complement (HFC) module is developed based on the self-attention mechanism (SAM) for the visual encoder to explore the complex contents from RS images fully. Second, a dual-branch multimodal fusion encoder (DBMFE) is designed to complete the cross-modal learning. It comprises a dual-branch multimodal interaction (DBMI) module and a branch fusion module. DBMI is designed to fully explore the relationships between different modalities, enriching visual and textual features. The branch fusion module integrates the cross-modal features and utilizes a classification head to generate matching scores for retrieval. Finally, a learning from prior experiences (LPEs) module is designed to reduce the influence of FN samples by analyzing the historical data produced in the model training process. Experiments are conducted on three popular datasets, and the positive results show that our PERSVL model achieves superior performance compared with previous methods. By integrating the advantages of natural language and RS images, our PERSVL can be applied in various applications, such as environmental monitoring, disaster evaluation, and urban planning. Our source codes are available at: https://github.com/TangXu-Group/Cross-modal-remote-sensing-image-and-text-retrieval-models/tree/main/PERSVL.

引用

页数：13

共 50 条

[31] Watershed-Based Attribute Profiles With Semantic Prior Knowledge for Remote Sensing Image Analysis
Maia, Deise Santana
Pham, Minh-Tan
Lefevre, Sebastien
[J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 : 2574 - 2591
[32] Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information
Yuan, Zhiqiang
Zhang, Wenkai
Tian, Changyuan
Rong, Xuee
Zhang, Zhengyuan
Wang, Hongqi
Fu, Kun
Sun, Xian
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[33] Advancing Real-World Stereoscopic Image Super-Resolution via Vision-Language Model
Zhang, Zhe
Lei, Jianjun
Peng, Bo
Zhu, Jie
Xu, Liying
Huang, Qingming
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 2187 - 2197
[34] Transformer-Based Multi-Scale Feature Remote Sensing Image Classification Model
Sun, Ting
Li, Jun
Zhou, Xiangrui
Chen, Zan
[J]. IEEE ACCESS, 2025, 13 : 34095 - 34104
[35] Text-Image Matching for Cross-Modal Remote Sensing Image Retrieval via Graph Neural Network
Yu, Hongfeng
Yao, Fanglong
Lu, Wanxuan
Liu, Nayu
Li, Peiguang
You, Hongjian
Sun, Xian
[J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2023, 16 : 812 - 824
[36] Scene Graph-Aware Hierarchical Fusion Network for Remote Sensing Image Retrieval With Text Feedback
Wang, Fei
Zhu, Xianzhang
Liu, Xiaojian
Zhang, Yongjun
Li, Yansheng
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 16
[37] The Remote Sensing Image Retrieval Based on Multi-feature
Duan Jian-bo
Ma Cai-hong
Liu Shi-Bin
Zhang Jing
[J]. IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XIX, 2013, 8892
[38] CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained Vision-Language Model
Zhao, Shuai
Quan, Ruijie
Zhu, Linchao
Yang, Yi
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6893 - 6904
[39] Learning From Self-Supervised Features for Hashing-Based Remote Sensing Image Retrieval
Tang, Jiayi
Wang, Dali
Tong, Xiaochong
Qiu, Chunping
Yang, Weiming
Lei, Yi
[J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
[40] Remote Sensing Image Scene Classification Model Based on Dual Knowledge Distillation
Li, Daxiang
Nan, Yixuan
Liu, Ying
[J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19

← 1 2 3 4 5 →