A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

被引：4

作者：

Sun, Dongwei ^{[1
,2
]}

Bao, Yajie ^{[1
,2
]}

Liu, Junmin ^{[3
]}

Cao, Xiangyong ^{[1
,2
]}

机构：

[1] Jiaotong Univ, Sch Cyber Sci & Engn, Xian 710049, Shaanxi, Peoples R China

[2] Xi An Jiao Tong Univ, Minist Educ, Key Lab Intelligent Networks, Xian 710049, Peoples R China

[3] Xi An Jiao Tong Univ, Sch Math & Stat, Dept Informat Sci, Xian 710049, Peoples R China

来源：

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING | 2024年 / 17卷

基金：

中国国家自然科学基金;

关键词：

Transformers; Feature extraction; Remote sensing; Kernel; Attention mechanisms; Accuracy; Sensors; Convolutional neural networks; Computational modeling; Visualization; Change captioning; remote sensing image change detection; sparse attention; transformer encoder;

D O I：

10.1109/JSTARS.2024.3471625

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this article proposes a sparse focus transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e., a high-level features extractor based on a convolutional neural network, a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods.

引用

页码：18727 / 18738

页数：12

共 41 条

[1] Learning by Imagination: A Joint Framework for Text-Based Image Manipulation and Change Captioning [J].

Ak, Kenan E. E. ;

Sun, Ying ;

Lim, Joo Hwee .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :3006-3016

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3]

Banerjee S., 2005, P ACL WORKSHOP INTRI, P65

[4]

Beltagy I, 2020, Arxiv, DOI [arXiv:2004.05150, 10.48550/arXiv.2004.05150]

[5] Changes to Captions: An Attentive Network for Remote Sensing Change Captioning [J].

Chang, Shizhen ;

Ghamisi, Pedram .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 :6047-6060

[6]

Chen C.-F., 2021, P INT C LEARN REPR, P1

[7] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

[8]

Child R, 2019, Arxiv, DOI arXiv:1904.10509

[9] CAPTIONING CHANGES IN BI-TEMPORAL REMOTE SENSING IMAGES [J].

Chouaf, Seloua ;

Hoxha, Genc ;

Smara, Youcef ;

Melgani, Farid .

2021 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM IGARSS, 2021, :2891-2894

[10]

Chu XX, 2021, ADV NEUR IN

← 1 2 3 4 5 →