A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

被引:0
作者
Sun, Dongwei [1 ,2 ]
Bao, Yajie [1 ,2 ]
Liu, Junmin [3 ]
Cao, Xiangyong [1 ,2 ]
机构
[1] Jiaotong Univ, Sch Cyber Sci & Engn, Xian 710049, Shaanxi, Peoples R China
[2] Xi An Jiao Tong Univ, Minist Educ, Key Lab Intelligent Networks, Xian 710049, Peoples R China
[3] Xi An Jiao Tong Univ, Sch Math & Stat, Dept Informat Sci, Xian 710049, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Feature extraction; Remote sensing; Kernel; Attention mechanisms; Accuracy; Sensors; Convolutional neural networks; Computational modeling; Visualization; Change captioning; remote sensing image change detection; sparse attention; transformer encoder;
D O I
10.1109/JSTARS.2024.3471625
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this article proposes a sparse focus transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e., a high-level features extractor based on a convolutional neural network, a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods.
引用
收藏
页码:18727 / 18738
页数:12
相关论文
共 41 条
[1]   Learning by Imagination: A Joint Framework for Text-Based Image Manipulation and Change Captioning [J].
Ak, Kenan E. E. ;
Sun, Ying ;
Lim, Joo Hwee .
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :3006-3016
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]  
Banerjee Satanjeev, 2005, P ACL WORKSHOP INTRI, P65
[4]  
Beltagy I, 2020, Arxiv, DOI arXiv:2004.05150
[5]   Changes to Captions: An Attentive Network for Remote Sensing Change Captioning [J].
Chang, Shizhen ;
Ghamisi, Pedram .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 :6047-6060
[6]  
Chen C.-F., 2021, P INT C LEARN REPR, P1
[7]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[8]  
Child R, 2019, Arxiv, DOI arXiv:1904.10509
[9]  
Chouaf S., 2021, PROC IEEE INT GEOSCI, P2891
[10]  
Chu XX, 2021, ADV NEUR IN