Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning

被引:0
|
作者
Liu, Anli [1 ]
Meng, Lingwu [1 ]
Xiao, Liang [1 ,2 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[2] Nanjing Univ Sci & Technol, Minist Educ, Key Lab Intelligent Percept & Syst High Dimens Inf, Nanjing 210094, Peoples R China
基金
美国国家科学基金会;
关键词
Image captioning; remote sensing; transformer; visual position encoding; NETWORK;
D O I
10.1109/JSTARS.2024.3487846
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Remote sensing image captioning (RSIC) is a crucial task in interpreting remote sensing images (RSIs), as it involves describing their content using clear and precise natural language. However, the RSIC encounters difficulties due to the intricate structure and distinctive features of the images, such as the issue of rotational ambiguity. The existence of visually alike objects or areas can result in misidentification. In addition, prioritizing groups of objects with strong relational ties during the captioning process poses a significant challenge. To address these challenges, we propose the visual rotated position encoding transformer for RSIC. First of all, rotation-invariant features and global features are extracted using a multilevel feature extraction (MFE) module. To focus on closely related rotated objects, we design a visual rotated position encoding module, which is incorporated into the transformer encoder to model directional relationships between objects. To distinguish similar features and guide caption generation, we propose a feature enhancement fusion module consisting of feature enhancement and feature fusion. The feature enhancement component adopts a self-attention mechanism to construct fully connected graphs for object features. The feature fusion component integrates global features and word vectors to guide the caption generation process. In addition, we construct an RSI rotated object detection dataset RSIC-ROD and pretrain a rotated object detector. The proposed method demonstrates significant performance improvements on four datasets, showcasing enhanced capabilities in preserving descriptive details, distinguishing similar objects, and accurately capturing object relationships.
引用
收藏
页码:20026 / 20040
页数:15
相关论文
共 50 条
  • [1] Cooperative Connection Transformer for Remote Sensing Image Captioning
    Zhao, Kai
    Xiong, Wei
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 14
  • [2] Image Captioning Based on An Improved Transformer with IoU Position Encoding
    Li, Yazhou
    Shi, Yihui
    Liu, Yun
    Li, Ruifan
    Ma, Zhanyu
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 2066 - 2071
  • [3] Region-guided transformer for remote sensing image captioning
    Zhao, Kai
    Xiong, Wei
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2024, 17 (01)
  • [4] Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning
    Kandala, Hitesh
    Saha, Sudipan
    Banerjee, Biplab
    Zhu, Xiao Xiang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [5] Prior Knowledge-Guided Transformer for Remote Sensing Image Captioning
    Meng, Lingwu
    Wang, Jing
    Yang, Yang
    Xiao, Liang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61 : 1 - 13
  • [6] Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer
    Liu, Chenyang
    Zhao, Rui
    Shi, Zhenwei
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [7] A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning
    Meng, Lingwu
    Wang, Jing
    Meng, Ran
    Yang, Yang
    Xiao, Liang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [8] Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer
    Zhuang, Shuo
    Wang, Ping
    Wang, Gang
    Wang, Di
    Chen, Jinyong
    Gao, Feng
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [9] Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer
    Liu, Chenyang
    Zhao, Rui
    Shi, Zhenwei
    IEEE Geoscience and Remote Sensing Letters, 2022, 19
  • [10] A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning
    Sun, Dongwei
    Bao, Yajie
    Liu, Junmin
    Cao, Xiangyong
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 18727 - 18738