Enhanced CLIP-GPT Framework for Cross-Lingual Remote Sensing Image Captioning

被引：0

作者：

Song, Rui ^{[1
]}

Zhao, Beigeng ^{[1
]}

Yu, Lizhi ^{[2
]}

机构：

[1] Criminal Invest Police Univ China, Coll Publ Secur Informat Technol & Intelligence, Shenyang 110035, Peoples R China

[2] Shenyang Publ Secur Bur, Yuhong Sub Bur, Shenyang 110141, Peoples R China

来源：

IEEE ACCESS | 2025年 / 13卷

关键词：

Feature extraction; Remote sensing; Training; Visualization; Decoding; Transformers; Adaptation models; Sensors; Semantics; Natural languages; Remote sensing image captioning; CLIP; GPT; deep learning; multimodal;

D O I：

10.1109/ACCESS.2024.3522585

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Remote Sensing Image Captioning (RSIC) aims to generate precise and informative descriptive text for remote sensing images using computational algorithms. Traditional "encoder-decoder" approaches face limitations due to their high training costs and heavy reliance on large-scale annotated datasets, hindering their practical applications. To address these challenges, we propose a lightweight solution based on an enhanced CLIP-GPT framework. Our approach utilizes CLIP for zero-shot multimodal feature extraction of remote sensing images, followed by the design and optimization of a mapping network based on an improved Transformer with adaptive multi-head attention to align these features with the text space of GPT-2, facilitating the generation of high-quality descriptive text. Experimental results on the Sydney-captions, UCM-captions, and RSICD datasets demonstrate that the proposed mapping network outperforms existing methods in leveraging CLIP-extracted multimodal features, leading to more accurate and stylistically appropriate text generated by the GPT language model. Furthermore, our method achieves comparable or superior performance to traditional "encoder-decoder" baselines in terms of BLEU, CIDEr, and METEOR metrics, while requiring only one-fifth of the training time. Experiments conducted on an additional Chinese-English bilingual RSIC dataset underscore the distinct advantages of our CLIP-GPT framework, which leverages extensive multimodal pre-training to effectively demonstrate the robust potential of this approach in cross-lingual RSIC tasks.

引用

页码：904 / 915

页数：12

共 50 条

[1] WordSentence Framework for Remote Sensing Image Captioning
Wang, Qi
Huang, Wei
Zhang, Xueting
Li, Xuelong
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2021, 59 (12): : 10532 - 10543
[2] A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning
Meng, Lingwu
Wang, Jing
Meng, Ran
Yang, Yang
Xiao, Liang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
[3] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning
Song, Zijie
Hu, Zhenzhen
Zhou, Yuanen
Zhao, Ye
Hong, Richang
Wang, Meng
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9008 - 9020
[4] Truncation Cross Entropy Loss for Remote Sensing Image Captioning
Li, Xuelong
Zhang, Xueting
Huang, Wei
Wang, Qi
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2021, 59 (06): : 5246 - 5257
[5] Interactive Concept Network Enhanced Transformer for Remote Sensing Image Captioning
Zhang, Cheng
Ren, Zhongle
Hou, Biao
Meng, Jianhua
Li, Weibin
Jiao, Licheng
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
[6] Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer
Zhuang, Shuo
Wang, Ping
Wang, Gang
Wang, Di
Chen, Jinyong
Gao, Feng
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
[7] GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning
Wang, Qi
Huang, Wei
Zhang, Xueting
Li, Xuelong
IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 6910 - 6922
[8] Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning
Li, Yunpeng
Zhang, Xiangrong
Cheng, Xina
Chen, Puhua
Jiao, Licheng
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[9] Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning
Kandala, Hitesh
Saha, Sudipan
Banerjee, Biplab
Zhu, Xiao Xiang
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
[10] Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning
Li, Yunpeng
Zhang, Xiangrong
Gu, Jing
Li, Chen
Wang, Xin
Tang, Xu
Jiao, Licheng
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60

← 1 2 3 4 5 →