Enhanced CLIP-GPT Framework for Cross-Lingual Remote Sensing Image Captioning

被引:0
|
作者
Song, Rui [1 ]
Zhao, Beigeng [1 ]
Yu, Lizhi [2 ]
机构
[1] Criminal Invest Police Univ China, Coll Publ Secur Informat Technol & Intelligence, Shenyang 110035, Peoples R China
[2] Shenyang Publ Secur Bur, Yuhong Sub Bur, Shenyang 110141, Peoples R China
来源
IEEE ACCESS | 2025年 / 13卷
关键词
Feature extraction; Remote sensing; Training; Visualization; Decoding; Transformers; Adaptation models; Sensors; Semantics; Natural languages; Remote sensing image captioning; CLIP; GPT; deep learning; multimodal;
D O I
10.1109/ACCESS.2024.3522585
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Remote Sensing Image Captioning (RSIC) aims to generate precise and informative descriptive text for remote sensing images using computational algorithms. Traditional "encoder-decoder" approaches face limitations due to their high training costs and heavy reliance on large-scale annotated datasets, hindering their practical applications. To address these challenges, we propose a lightweight solution based on an enhanced CLIP-GPT framework. Our approach utilizes CLIP for zero-shot multimodal feature extraction of remote sensing images, followed by the design and optimization of a mapping network based on an improved Transformer with adaptive multi-head attention to align these features with the text space of GPT-2, facilitating the generation of high-quality descriptive text. Experimental results on the Sydney-captions, UCM-captions, and RSICD datasets demonstrate that the proposed mapping network outperforms existing methods in leveraging CLIP-extracted multimodal features, leading to more accurate and stylistically appropriate text generated by the GPT language model. Furthermore, our method achieves comparable or superior performance to traditional "encoder-decoder" baselines in terms of BLEU, CIDEr, and METEOR metrics, while requiring only one-fifth of the training time. Experiments conducted on an additional Chinese-English bilingual RSIC dataset underscore the distinct advantages of our CLIP-GPT framework, which leverages extensive multimodal pre-training to effectively demonstrate the robust potential of this approach in cross-lingual RSIC tasks.
引用
收藏
页码:904 / 915
页数:12
相关论文
共 50 条
  • [1] WordSentence Framework for Remote Sensing Image Captioning
    Wang, Qi
    Huang, Wei
    Zhang, Xueting
    Li, Xuelong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2021, 59 (12): : 10532 - 10543
  • [2] A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning
    Meng, Lingwu
    Wang, Jing
    Meng, Ran
    Yang, Yang
    Xiao, Liang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [3] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning
    Song, Zijie
    Hu, Zhenzhen
    Zhou, Yuanen
    Zhao, Ye
    Hong, Richang
    Wang, Meng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9008 - 9020
  • [4] Truncation Cross Entropy Loss for Remote Sensing Image Captioning
    Li, Xuelong
    Zhang, Xueting
    Huang, Wei
    Wang, Qi
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2021, 59 (06): : 5246 - 5257
  • [5] Interactive Concept Network Enhanced Transformer for Remote Sensing Image Captioning
    Zhang, Cheng
    Ren, Zhongle
    Hou, Biao
    Meng, Jianhua
    Li, Weibin
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [6] Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer
    Zhuang, Shuo
    Wang, Ping
    Wang, Gang
    Wang, Di
    Chen, Jinyong
    Gao, Feng
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [7] GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning
    Wang, Qi
    Huang, Wei
    Zhang, Xueting
    Li, Xuelong
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 6910 - 6922
  • [8] Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning
    Li, Yunpeng
    Zhang, Xiangrong
    Cheng, Xina
    Chen, Puhua
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [9] Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning
    Kandala, Hitesh
    Saha, Sudipan
    Banerjee, Biplab
    Zhu, Xiao Xiang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [10] Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning
    Li, Yunpeng
    Zhang, Xiangrong
    Gu, Jing
    Li, Chen
    Wang, Xin
    Tang, Xu
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60