Open-ended remote sensing visual question answering with transformers

被引：9

作者：

Al Rahhal, Mohamad M. ^{[1
,5
]}

Bazi, Yakoub ^{[2
]}

Alsaleh, Sara O. ^{[2
]}

Al-Razgan, Muna ^{[3
]}

Mekhalfi, Mohamed Lamine ^{[4
]}

Al Zuair, Mansour ^{[2
]}

Alajlan, Naif ^{[2
]}

机构：

[1] King Saud Univ, Coll Appl Comp Sci, Appl Comp Sci Dept, Riyadh, Saudi Arabia

[2] King Saud Univ, Coll Comp & Informat Sci, Comp Engn Dept, Riyadh, Saudi Arabia

[3] King Saud Univ, Coll Comp & Informat Sci, Dept Software Engn, Riyadh, Saudi Arabia

[4] Fdn Bruno Kessler, Digital Ind Ctr, Technol Vis Unit, Trento, Italy

[5] King Saud Univ, Coll Appl Comp Sci, Appl Comp Sci Dept, PO Box 51178, Riyadh 11543, Saudi Arabia

来源：

INTERNATIONAL JOURNAL OF REMOTE SENSING | 2022年 / 43卷 / 18期

关键词：

Visual question answering; remote sensing; open-set dataset; vision transformers; encoder-decoder architecture;

D O I：

10.1080/01431161.2022.2145583

中图分类号：

TP7 [遥感技术];

学科分类号：

081102 ; 0816 ; 081602 ; 083002 ; 1404 ;

摘要：

Visual question answering (VQA) has been attracting attention in remote sensing very recently. However, the proposed solutions remain rather limited in the sense that the existing VQA datasets address closed-ended question-answer queries, which may not necessarily reflect real open-ended scenarios. In this paper, we propose a new dataset named VQA-TextRS that was built manually with human annotations and considers various forms of open-ended question-answer pairs. Moreover, we propose an encoder-decoder architecture via transformers on account of their self-attention property that allows relational learning of different positions of the same sequence without the need of typical recurrence operations. Thus, we employed vision and natural language processing (NLP) transformers respectively to draw visual and textual cues from the image and respective question. Afterwards, we applied a transformer decoder, which enables the cross-attention mechanism to fuse the earlier two modalities. The fusion vectors correlate with the process of answer generation to produce the final form of the output. We demonstrate that plausible results can be obtained in open-ended VQA. For instance, the proposed architecture scores an accuracy of 84.01% on questions related to the presence of objects in the query images.

引用

页码：6809 / 6823

页数：15

共 50 条

[1] OPEN-ENDED VISUAL QUESTION ANSWERING MODEL FOR REMOTE SENSING IMAGES
Alsaleh, Sara O.
Bazi, Yakoub
Al Rahhal, Mohamad M.
Al Zuair, Mansour
2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 2848 - 2851
[2] LANGUAGE TRANSFORMERS FOR REMOTE SENSING VISUAL QUESTION ANSWERING
Chappuis, Christel
Mendez, Vincent
Walt, Eliot
Lobry, Sylvain
Le Saux, Bertrand
Tuia, Devis
2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 4855 - 4858
[3] Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation
Xu, Yiming
Chen, Lin
Cheng, Zhongwei
Duan, Lixin
Luo, Jiebo
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 367 - 376
[4] Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
Fu, Xingyu
Zhang, Sheng
Kwon, Gukyeong
Perera, Pramuditha
Zhu, Henghui
Zhang, Yuhao
Li, Alexander Hanbo
Wang, William
Wang, Zhiguo
Castelli, Vittorio
Ng, Patrick
Roth, Dan
Xiang, Bing
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2333 - 2344
[5] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
van Sonsbeek, Tom
Derakhshani, Mohammad Mahdi
Najdenkoska, Ivona
Snoek, Cees G. M.
Worring, Marcel
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 726 - 736
[6] Unifying the Video and Question Attentions for Open-Ended Video Question Answering
Xue, Hongyang
Zhao, Zhou
Cai, Deng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (12) : 5656 - 5666
[7] The Open-Ended Question
Chapman-Novakofski, Karen
JOURNAL OF NUTRITION EDUCATION AND BEHAVIOR, 2011, 43 (03) : 141 - 141
[8] Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering
Jin, Yao
Niu, Guocheng
Xiao, Xinyan
Zhang, Jian
Peng, Xi
Yu, Jun
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 8141 - 8149
[9] Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
Luo, Haozheng
Qin, Ruiyang
Xu, Chenwei
Ye, Guo
Luo, Zening
2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, 2023, : 363 - 369
[10] Coarse to Fine Frame Selection for Online Open-ended Video Question Answering
Nuthalapati, Sai Vidyaranya
Tunga, Anirudh
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 353 - 361

← 1 2 3 4 5 →