RSMoDM: Multimodal Momentum Distillation Model for Remote Sensing Visual Question Answering

被引:0
|
作者
Li, Pengfei [1 ]
Liu, Gang [1 ]
He, Jinlong [1 ]
Meng, Xiangxu [1 ]
Zhong, Shenjun [2 ,3 ]
Chen, Xun [4 ,5 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Monash Univ, Monash Biomed Imaging, Clayton, Vic 3800, Australia
[3] Natl Imaging Facil, Woolloongabba, Qld 4102, Australia
[4] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230027, Peoples R China
[5] Hefei Comprehens Natl Sci Ctr, Inst Dataspace, Hefei 230088, Peoples R China
关键词
Feature extraction; Visualization; Task analysis; Transformers; Question answering (information retrieval); Computational modeling; Computer architecture; Cross-modal representation learning; domain shift; momentum distillation; remote sensing visual question answering (RS VQA);
D O I
10.1109/JSTARS.2024.3419035
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Remote sensing (RS) visual question answering (VQA) is a task that answers questions about a given RS image by utilizing both image and textual information. However, existing methods in RS VQA overlook the fact that the ground truths in RS VQA benchmark datasets, which are algorithmically generated rather than manually annotated, may not always represent the most reasonable answers to the questions. In this article, we propose a multimodal momentum distillation model (RSMoDM) for RS VQA tasks. Specifically, we maintain the momentum distillation model during the training stage that generates stable and reliable pseudolabels for additional supervision, effectively preventing the model from being penalized for producing other reasonable outputs that differ from ground truth. Additionally, to address domain shift in RS, we employ the Vision Transformer (ViT) trained on a large-scale RS dataset for enhanced image feature extraction. Moreover, we introduce the multimodal fusion module with cross-attention for improved cross-modal representation learning. Our extensive experiments across three different RS VQA datasets demonstrate that RSMoDM achieves state-of-the-art performance, particularly excelling in scenarios with limited training data. The strong interpretability of our method is further evidenced by visualized attention maps.
引用
收藏
页码:16799 / 16814
页数:16
相关论文
共 50 条
  • [41] FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
    Singh, Shubhankar
    Chaurasia, Purvi
    Varun, Yerram
    Pandya, Pranshu
    Gupta, Vatsal
    Gupta, Vivek
    Roth, Dan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1330 - 1350
  • [42] Dual-Key Multimodal Backdoors for Visual Question Answering
    Walmer, Matthew
    Sikka, Karan
    Sur, Indranil
    Shrivastava, Abhinav
    Jha, Susmit
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15354 - 15364
  • [43] Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
    Saqur, Raeid
    Narasimhan, Karthik
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [44] Contrastive training of a multimodal encoder for medical visual question answering
    Silva, Joao Daniel
    Martins, Bruno
    Magalhaes, Joao
    INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 18
  • [45] Multimodal attention-driven visual question answering for Malayalam
    Kovath A.G.
    Nayyar A.
    Sikha O.K.
    Neural Computing and Applications, 2024, 36 (24) : 14691 - 14708
  • [46] EduVQA: A multimodal Visual Question Answering framework for smart education
    Xiao, Jiongen
    Zhang, Zifeng
    ALEXANDRIA ENGINEERING JOURNAL, 2025, 122 : 615 - 624
  • [47] Visual Question Answering based on multimodal triplet knowledge accumuation
    Wang, Fengjuan
    An, Gaoyun
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 81 - 84
  • [48] Dual-Key Multimodal Backdoors for Visual Question Answering
    Walmer, Matthew
    Sikka, Karan
    Sur, Indranil
    Shrivastava, Abhinav
    Jha, Susmit
    arXiv, 2021,
  • [49] A Video Question Answering Model Based on Knowledge Distillation
    Shao, Zhuang
    Wan, Jiahui
    Zong, Linlin
    INFORMATION, 2023, 14 (06)
  • [50] Improving Visual Question Answering by Multimodal Gate Fusion Network
    Xiang, Shenxiang
    Chen, Qiaohong
    Fang, Xian
    Guo, Menghao
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,