RSVQ-Diffusion Model for Text-to-Remote-Sensing Image Generation

被引:1
作者
Gao, Xin [1 ,2 ]
Fu, Yao [1 ]
Jiang, Xiaonan [1 ]
Wu, Fanlu [1 ]
Zhang, Yu [1 ]
Fu, Tianjiao [1 ]
Li, Chao [1 ]
Pei, Junyan [1 ]
机构
[1] Chinese Acad Sci, Changchun Inst Opt, Fine Mech & Phys, Changchun 130033, Peoples R China
[2] Univ Chinese Acad Sci, Sch Optoelect, Beijing 100049, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2025年 / 15卷 / 03期
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
remote sensing image; transformer; diffusion model; image generation; VQ-Diffusion; REPRESENTATIONS;
D O I
10.3390/app15031121
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Despite significant challenges, the text-guided remote sensing image generation method shows great potential in many practical applications such as generative adversarial networks in remote sensing tasks; generated images still face challenges such as low realism, face challenges, and unclear details. Moreover, the inherent spatial complexity of remote sensing images and the limited scale of publicly available datasets make it particularly challenging to generate high-quality remote sensing images from text descriptions. To address these challenges, this paper proposes the RSVQ-Diffusion model for remote sensing image generation, achieving high-quality text-to-remote-sensing image generation applicable for target detection, simulation, and other fields. Specifically, this paper designs a spatial position encoding mechanism to integrate the spatial information of remote sensing images during model training. Additionally, the Transformer module is improved by incorporating a short-sequence local perception mechanism into the diffusion image decoder, addressing issues of unclear details and regional distortions in generated remote sensing images. Compared with the VQ-Diffusion model, our proposed model achieves significant improvements in the Fr & eacute;chet Inception Distance (FID), the Inception Score (IS), and the text-image alignment (Contrastive Language-Image Pre-training, CLIP) scores. The FID score successfully decreased from 96.68 to 90.36; the CLIP score increased from 26.92 to 27.22, and the IS increased from 7.11 to 7.24.
引用
收藏
页数:19
相关论文
共 32 条
[1]  
Barratt S, 2018, Arxiv, DOI arXiv:1801.01973
[2]  
Beltagy I, 2020, Arxiv, DOI arXiv:2004.05150
[3]  
Berthelot D, 2017, Arxiv, DOI arXiv:1703.10717
[4]  
Dhariwal P, 2021, ADV NEUR IN, V34
[5]   Taming Transformers for High-Resolution Image Synthesis [J].
Esser, Patrick ;
Rombach, Robin ;
Ommer, Bjoern .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12868-12878
[6]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[7]   Vector Quantized Diffusion Model for Text-to-Image Synthesis [J].
Gu, Shuyang ;
Chen, Dong ;
Bao, Jianmin ;
Wen, Fang ;
Zhang, Bo ;
Chen, Dongdong ;
Yuan, Lu ;
Guo, Baining .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :10686-10696
[8]  
Heusel M, 2017, ADV NEUR IN, V30
[9]  
Ho Jonathan., 2020, P 34 INT C NEURAL IN, P6840
[10]  
Khanna S, 2024, Arxiv, DOI [arXiv:2312.03606, 10.48550/arXiv.2312.03606]