RSMoDM: Multimodal Momentum Distillation Model for Remote Sensing Visual Question Answering

被引:0
|
作者
Li, Pengfei [1 ]
Liu, Gang [1 ]
He, Jinlong [1 ]
Meng, Xiangxu [1 ]
Zhong, Shenjun [2 ,3 ]
Chen, Xun [4 ,5 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Monash Univ, Monash Biomed Imaging, Clayton, Vic 3800, Australia
[3] Natl Imaging Facil, Woolloongabba, Qld 4102, Australia
[4] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230027, Peoples R China
[5] Hefei Comprehens Natl Sci Ctr, Inst Dataspace, Hefei 230088, Peoples R China
关键词
Feature extraction; Visualization; Task analysis; Transformers; Question answering (information retrieval); Computational modeling; Computer architecture; Cross-modal representation learning; domain shift; momentum distillation; remote sensing visual question answering (RS VQA);
D O I
10.1109/JSTARS.2024.3419035
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Remote sensing (RS) visual question answering (VQA) is a task that answers questions about a given RS image by utilizing both image and textual information. However, existing methods in RS VQA overlook the fact that the ground truths in RS VQA benchmark datasets, which are algorithmically generated rather than manually annotated, may not always represent the most reasonable answers to the questions. In this article, we propose a multimodal momentum distillation model (RSMoDM) for RS VQA tasks. Specifically, we maintain the momentum distillation model during the training stage that generates stable and reliable pseudolabels for additional supervision, effectively preventing the model from being penalized for producing other reasonable outputs that differ from ground truth. Additionally, to address domain shift in RS, we employ the Vision Transformer (ViT) trained on a large-scale RS dataset for enhanced image feature extraction. Moreover, we introduce the multimodal fusion module with cross-attention for improved cross-modal representation learning. Our extensive experiments across three different RS VQA datasets demonstrate that RSMoDM achieves state-of-the-art performance, particularly excelling in scenarios with limited training data. The strong interpretability of our method is further evidenced by visualized attention maps.
引用
收藏
页码:16799 / 16814
页数:16
相关论文
共 50 条
  • [31] Scale-guided Fusion Inference Network for Remote Sensing Visual Question Answering
    Zhao E.-Y.
    Song N.
    Nie J.
    Wang X.
    Zheng C.-Y.
    Wei Z.-Q.
    Ruan Jian Xue Bao/Journal of Software, 2024, 35 (05): : 2133 - 2149
  • [32] A multi-scale contextual attention network for remote sensing visual question answering
    Feng, Jiangfan
    Wang, Hui
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 126
  • [33] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
    Mahamoud, Ibrahim Souleiman
    Coustaty, Mickael
    Joseph, Aurelie
    d'Andecy, Vincent Poulain
    Ogier, Jean-Marc
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
  • [34] EarthVQANet: Multi-task visual question answering for remote sensing image understanding
    Wang, Junjue
    Ma, Ailong
    Chen, Zihang
    Zheng, Zhuo
    Wan, Yuting
    Zhang, Liangpei
    Zhong, Yanfei
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2024, 212 : 422 - 439
  • [35] ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese
    Tran, Khiem Vinh
    Phan, Hao Phu
    Van Nguyen, Kiet
    Nguyen, Ngan Luu Thuy
    MULTIMEDIA SYSTEMS, 2024, 30 (04)
  • [36] Trilinear Distillation Learning and Question Feature Capturing for Medical Visual Question Answering
    Long, Shaopei
    Li, Yong
    Weng, Heng
    Tang, Buzhou
    Wang, Fu Lee
    Hao, Tianyong
    NEURAL COMPUTING FOR ADVANCED APPLICATIONS, NCAA 2024, PT III, 2025, 2183 : 162 - 177
  • [37] AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering
    Pan, Haiwei
    He, Shuning
    Zhang, Kejia
    Qu, Bo
    Chen, Chunling
    Shi, Kun
    KNOWLEDGE-BASED SYSTEMS, 2022, 255
  • [38] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
    Li, Haiyan
    Han, Dezhi
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
  • [39] Multimodal fusion: advancing medical visual question-answering
    Mudgal, Anjali
    Kush, Udbhav
    Kumar, Aditya
    Jafari, Amir
    Neural Computing and Applications, 2024, 36 (33) : 20949 - 20962
  • [40] Multimodal Local Perception Bilinear Pooling for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    IEEE ACCESS, 2018, 6 : 57923 - 57932