RSMoDM: Multimodal Momentum Distillation Model for Remote Sensing Visual Question Answering

被引:0
|
作者
Li, Pengfei [1 ]
Liu, Gang [1 ]
He, Jinlong [1 ]
Meng, Xiangxu [1 ]
Zhong, Shenjun [2 ,3 ]
Chen, Xun [4 ,5 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Monash Univ, Monash Biomed Imaging, Clayton, Vic 3800, Australia
[3] Natl Imaging Facil, Woolloongabba, Qld 4102, Australia
[4] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230027, Peoples R China
[5] Hefei Comprehens Natl Sci Ctr, Inst Dataspace, Hefei 230088, Peoples R China
关键词
Feature extraction; Visualization; Task analysis; Transformers; Question answering (information retrieval); Computational modeling; Computer architecture; Cross-modal representation learning; domain shift; momentum distillation; remote sensing visual question answering (RS VQA);
D O I
10.1109/JSTARS.2024.3419035
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Remote sensing (RS) visual question answering (VQA) is a task that answers questions about a given RS image by utilizing both image and textual information. However, existing methods in RS VQA overlook the fact that the ground truths in RS VQA benchmark datasets, which are algorithmically generated rather than manually annotated, may not always represent the most reasonable answers to the questions. In this article, we propose a multimodal momentum distillation model (RSMoDM) for RS VQA tasks. Specifically, we maintain the momentum distillation model during the training stage that generates stable and reliable pseudolabels for additional supervision, effectively preventing the model from being penalized for producing other reasonable outputs that differ from ground truth. Additionally, to address domain shift in RS, we employ the Vision Transformer (ViT) trained on a large-scale RS dataset for enhanced image feature extraction. Moreover, we introduce the multimodal fusion module with cross-attention for improved cross-modal representation learning. Our extensive experiments across three different RS VQA datasets demonstrate that RSMoDM achieves state-of-the-art performance, particularly excelling in scenarios with limited training data. The strong interpretability of our method is further evidenced by visualized attention maps.
引用
收藏
页码:16799 / 16814
页数:16
相关论文
共 50 条
  • [1] RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering
    Wang, Yuduo
    Ghamisi, Pedram
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [2] VISUAL QUESTION ANSWERING IN REMOTE SENSING WITH CROSS-ATTENTION AND MULTIMODAL INFORMATION BOTTLENECK
    Songara, Jayesh
    Pande, Shivam
    Choudhury, Shabnam
    Banerjee, Biplab
    Velmurugan, Rajbabu
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6278 - 6281
  • [3] VISUAL QUESTION ANSWERING FROM REMOTE SENSING IMAGES
    Lobry, Sylvain
    Murray, Jesse
    Marcos, Diego
    Tuia, Devis
    2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 4951 - 4954
  • [4] RSVQA: Visual Question Answering for Remote Sensing Data
    Lobry, Sylvain
    Marcos, Diego
    Murray, Jesse
    Tuia, Devis
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2020, 58 (12): : 8555 - 8566
  • [5] LANGUAGE TRANSFORMERS FOR REMOTE SENSING VISUAL QUESTION ANSWERING
    Chappuis, Christel
    Mendez, Vincent
    Walt, Eliot
    Lobry, Sylvain
    Le Saux, Bertrand
    Tuia, Devis
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 4855 - 4858
  • [6] PERS: Parameter-Efficient Multimodal Transfer Learning for Remote Sensing Visual Question Answering
    He, Jinlong
    Liu, Gang
    Li, Pengfei
    Su, Xiaonan
    Jiang, Wenhua
    Zhang, Dongze
    Zhong, Shenjun
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 14823 - 14835
  • [7] Multistep Question-Driven Visual Question Answering for Remote Sensing
    Zhang, Meimei
    Chen, Fang
    Li, Bin
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [8] OPEN-ENDED VISUAL QUESTION ANSWERING MODEL FOR REMOTE SENSING IMAGES
    Alsaleh, Sara O.
    Bazi, Yakoub
    Al Rahhal, Mohamad M.
    Al Zuair, Mansour
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 2848 - 2851
  • [9] Embedding Spatial Relations in Visual Question Answering for Remote Sensing
    Faure, Maxime
    Lobry, Sylvain
    Kurtz, Camille
    Wendling, Laurent
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 310 - 316
  • [10] Answer Distillation for Visual Question Answering
    Fang, Zhiwei
    Liu, Jing
    Tang, Qu
    Li, Yong
    Lu, Hanqing
    COMPUTER VISION - ACCV 2018, PT I, 2019, 11361 : 72 - 87