Open-ended remote sensing visual question answering with transformers

被引:9
|
作者
Al Rahhal, Mohamad M. [1 ,5 ]
Bazi, Yakoub [2 ]
Alsaleh, Sara O. [2 ]
Al-Razgan, Muna [3 ]
Mekhalfi, Mohamed Lamine [4 ]
Al Zuair, Mansour [2 ]
Alajlan, Naif [2 ]
机构
[1] King Saud Univ, Coll Appl Comp Sci, Appl Comp Sci Dept, Riyadh, Saudi Arabia
[2] King Saud Univ, Coll Comp & Informat Sci, Comp Engn Dept, Riyadh, Saudi Arabia
[3] King Saud Univ, Coll Comp & Informat Sci, Dept Software Engn, Riyadh, Saudi Arabia
[4] Fdn Bruno Kessler, Digital Ind Ctr, Technol Vis Unit, Trento, Italy
[5] King Saud Univ, Coll Appl Comp Sci, Appl Comp Sci Dept, PO Box 51178, Riyadh 11543, Saudi Arabia
关键词
Visual question answering; remote sensing; open-set dataset; vision transformers; encoder-decoder architecture;
D O I
10.1080/01431161.2022.2145583
中图分类号
TP7 [遥感技术];
学科分类号
081102 ; 0816 ; 081602 ; 083002 ; 1404 ;
摘要
Visual question answering (VQA) has been attracting attention in remote sensing very recently. However, the proposed solutions remain rather limited in the sense that the existing VQA datasets address closed-ended question-answer queries, which may not necessarily reflect real open-ended scenarios. In this paper, we propose a new dataset named VQA-TextRS that was built manually with human annotations and considers various forms of open-ended question-answer pairs. Moreover, we propose an encoder-decoder architecture via transformers on account of their self-attention property that allows relational learning of different positions of the same sequence without the need of typical recurrence operations. Thus, we employed vision and natural language processing (NLP) transformers respectively to draw visual and textual cues from the image and respective question. Afterwards, we applied a transformer decoder, which enables the cross-attention mechanism to fuse the earlier two modalities. The fusion vectors correlate with the process of answer generation to produce the final form of the output. We demonstrate that plausible results can be obtained in open-ended VQA. For instance, the proposed architecture scores an accuracy of 84.01% on questions related to the presence of objects in the query images.
引用
收藏
页码:6809 / 6823
页数:15
相关论文
共 50 条
  • [1] OPEN-ENDED VISUAL QUESTION ANSWERING MODEL FOR REMOTE SENSING IMAGES
    Alsaleh, Sara O.
    Bazi, Yakoub
    Al Rahhal, Mohamad M.
    Al Zuair, Mansour
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 2848 - 2851
  • [2] LANGUAGE TRANSFORMERS FOR REMOTE SENSING VISUAL QUESTION ANSWERING
    Chappuis, Christel
    Mendez, Vincent
    Walt, Eliot
    Lobry, Sylvain
    Le Saux, Bertrand
    Tuia, Devis
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 4855 - 4858
  • [3] Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation
    Xu, Yiming
    Chen, Lin
    Cheng, Zhongwei
    Duan, Lixin
    Luo, Jiebo
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 367 - 376
  • [4] Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
    Fu, Xingyu
    Zhang, Sheng
    Kwon, Gukyeong
    Perera, Pramuditha
    Zhu, Henghui
    Zhang, Yuhao
    Li, Alexander Hanbo
    Wang, William
    Wang, Zhiguo
    Castelli, Vittorio
    Ng, Patrick
    Roth, Dan
    Xiang, Bing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2333 - 2344
  • [5] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
    van Sonsbeek, Tom
    Derakhshani, Mohammad Mahdi
    Najdenkoska, Ivona
    Snoek, Cees G. M.
    Worring, Marcel
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 726 - 736
  • [6] Unifying the Video and Question Attentions for Open-Ended Video Question Answering
    Xue, Hongyang
    Zhao, Zhou
    Cai, Deng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (12) : 5656 - 5666
  • [7] The Open-Ended Question
    Chapman-Novakofski, Karen
    JOURNAL OF NUTRITION EDUCATION AND BEHAVIOR, 2011, 43 (03) : 141 - 141
  • [8] Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering
    Jin, Yao
    Niu, Guocheng
    Xiao, Xinyan
    Zhang, Jian
    Peng, Xi
    Yu, Jun
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 8141 - 8149
  • [9] Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
    Luo, Haozheng
    Qin, Ruiyang
    Xu, Chenwei
    Ye, Guo
    Luo, Zening
    2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, 2023, : 363 - 369
  • [10] Coarse to Fine Frame Selection for Online Open-ended Video Question Answering
    Nuthalapati, Sai Vidyaranya
    Tunga, Anirudh
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 353 - 361