RSMoDM: Multimodal Momentum Distillation Model for Remote Sensing Visual Question Answering

被引:0
|
作者
Li, Pengfei [1 ]
Liu, Gang [1 ]
He, Jinlong [1 ]
Meng, Xiangxu [1 ]
Zhong, Shenjun [2 ,3 ]
Chen, Xun [4 ,5 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Monash Univ, Monash Biomed Imaging, Clayton, Vic 3800, Australia
[3] Natl Imaging Facil, Woolloongabba, Qld 4102, Australia
[4] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230027, Peoples R China
[5] Hefei Comprehens Natl Sci Ctr, Inst Dataspace, Hefei 230088, Peoples R China
关键词
Feature extraction; Visualization; Task analysis; Transformers; Question answering (information retrieval); Computational modeling; Computer architecture; Cross-modal representation learning; domain shift; momentum distillation; remote sensing visual question answering (RS VQA);
D O I
10.1109/JSTARS.2024.3419035
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Remote sensing (RS) visual question answering (VQA) is a task that answers questions about a given RS image by utilizing both image and textual information. However, existing methods in RS VQA overlook the fact that the ground truths in RS VQA benchmark datasets, which are algorithmically generated rather than manually annotated, may not always represent the most reasonable answers to the questions. In this article, we propose a multimodal momentum distillation model (RSMoDM) for RS VQA tasks. Specifically, we maintain the momentum distillation model during the training stage that generates stable and reliable pseudolabels for additional supervision, effectively preventing the model from being penalized for producing other reasonable outputs that differ from ground truth. Additionally, to address domain shift in RS, we employ the Vision Transformer (ViT) trained on a large-scale RS dataset for enhanced image feature extraction. Moreover, we introduce the multimodal fusion module with cross-attention for improved cross-modal representation learning. Our extensive experiments across three different RS VQA datasets demonstrate that RSMoDM achieves state-of-the-art performance, particularly excelling in scenarios with limited training data. The strong interpretability of our method is further evidenced by visualized attention maps.
引用
收藏
页码:16799 / 16814
页数:16
相关论文
共 50 条
  • [21] Faithful Multimodal Explanation for Visual Question Answering
    Wu, Jialin
    Mooney, Raymond J.
    BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 103 - 112
  • [22] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [23] Learning to Specialize with Knowledge Distillation for Visual Question Answering
    Mun, Jonghwan
    Lee, Kimin
    Shin, Jinwoo
    Han, Bohyung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [24] Medical Visual Question Answering Model Based on Knowledge Enhancement and Multimodal Fusion
    Dianyuan, Zhang
    Chuanming, Yu
    Data Analysis and Knowledge Discovery, 2024, 8 (8-9) : 226 - 239
  • [25] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
    Ben-younes, Hedi
    Cadene, Remi
    Cord, Matthieu
    Thome, Nicolas
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
  • [26] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
  • [27] MUREL: Multimodal Relational Reasoning for Visual Question Answering
    Cadene, Remi
    Ben-younes, Hedi
    Cord, Matthieu
    Thome, Nicolas
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
  • [28] Multimodal Prompt Retrieval for Generative Visual Question Answering
    Ossowski, Timothy
    Hu, Junjie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2518 - 2535
  • [29] SEGMENTATION-GUIDED ATTENTION FOR VISUAL QUESTION ANSWERING FROM REMOTE SENSING IMAGES
    Tosato, Lucrezia
    Boussaid, Hichem
    Weissgerber, Flora
    Kurtz, Camille
    Wendling, Laurent
    Lobry, Sylvain
    IGARSS 2024-2024 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, IGARSS 2024, 2024, : 2750 - 2754
  • [30] OVERCOMING LANGUAGE BIAS IN REMOTE SENSING VISUAL QUESTION ANSWERING VIA ADVERSARIAL TRAINING
    Yuan, Zhenghang
    Mou, Lichao
    Zhu, Xiao Xiang
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 2235 - 2238