Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval

被引：6

作者：

Al Rahhal, Mohamad M. ^{[1
]}

Bencherif, Mohamed Abdelkader ^{[2
]}

Bazi, Yakoub ^{[3
]}

Alharbi, Abdullah ^{[4
]}

Mekhalfi, Mohamed Lamine ^{[5
]}

机构：

[1] King Saud Univ, Coll Appl Comp Sci, Appl Comp Sci Dept, Riyadh 11543, Saudi Arabia

[2] King Saud Univ, Coll Comp & Informat Sci, Ctr Smart Robot Res, Riyadh 11543, Saudi Arabia

[3] King Saud Univ, Coll Comp & Informat Sci, Comp Engn Dept, Riyadh 11543, Saudi Arabia

[4] King Saud Univ, Commun Coll, Dept Comp Sci, Riyadh 11437, Saudi Arabia

[5] Fdn Bruno Kessler, Digital Ind Ctr, Technol Vis Unit, I-38123 Trento, Italy

来源：

APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 01期

关键词：

remote sensing; cross-modal retrieval; vision and language transformers; contrastive loss; ATTENTION;

D O I：

10.3390/app13010282

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Remote sensing technology has advanced rapidly in recent years. Because of the deployment of quantitative and qualitative sensors, as well as the evolution of powerful hardware and software platforms, it powers a wide range of civilian and military applications. This in turn leads to the availability of large data volumes suitable for a broad range of applications such as monitoring climate change. Yet, processing, retrieving, and mining large data are challenging. Usually, content-based remote sensing image (RS) retrieval approaches rely on a query image to retrieve relevant images from the dataset. To increase the flexibility of the retrieval experience, cross-modal representations based on text-image pairs are gaining popularity. Indeed, combining text and image domains is regarded as one of the next frontiers in RS image retrieval. Yet, aligning text to the content of RS images is particularly challenging due to the visual-sematic discrepancy between language and vision worlds. In this work, we propose different architectures based on vision and language transformers for text-to-image and image-to-text retrieval. Extensive experimental results on four different datasets, namely TextRS, Merced, Sydney, and RSICD datasets are reported and discussed.

引用

页数：14

共 34 条

[1] TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images [J].

Abdullah, Taghreed ;

Bazi, Yakoub ;

Al Rahhal, Mohamad M. ;

Mekhalfi, Mohamed L. ;

Rangarajan, Lalitha ;

Zuair, Mansour .

REMOTE SENSING, 2020, 12 (03)

[2] UAV Image Multi-Labeling with Data-Efficient Transformers [J].

Bashmal, Laila ;

Bazi, Yakoub ;

Al Rahhal, Mohamad Mahmoud ;

Alhichri, Haikel ;

Al Ajlan, Naif .

APPLIED SCIENCES-BASEL, 2021, 11 (09)

[3] Vision Transformers for Remote Sensing Image Classification [J].

Bazi, Yakoub ;

Bashmal, Laila ;

Rahhal, Mohamad M. Al ;

Dayil, Reham Al ;

Ajlan, Naif Al .

REMOTE SENSING, 2021, 13 (03) :1-20

[4]

Brown TB, 2020, ADV NEUR IN, V33

[5] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[6] Remote Sensing Image Change Detection With Transformers [J].

Chen, Hao ;

Qi, Zipeng ;

Shi, Zhenwei .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60

[7] Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities [J].

Cheng, Gong ;

Xie, Xingxing ;

Han, Junwei ;

Guo, Lei ;

Xia, Gui-Song .

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2020, 13 :3735-3756

[8] Remote Sensing Image Scene Classification: Benchmark and State of the Art [J].

Cheng, Gong ;

Han, Junwei ;

Lu, Xiaoqiang .

PROCEEDINGS OF THE IEEE, 2017, 105 (10) :1865-1883

[9] NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning [J].

Cheng, Qimin ;

Huang, Haiyan ;

Xu, Yuan ;

Zhou, Yuzhuo ;

Li, Huanying ;

Wang, Zhongyuan .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60

[10] Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing [J].

Cheng, Qimin ;

Zhou, Yuzhuo ;

Huang, Haiyan ;

Wang, Zhongyuan .

IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2022, 9 (08) :1532-1535

← 1 2 3 4 →