EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering

被引:0
|
作者
Wang, Junjue [1 ]
Zheng, Zhuo [2 ]
Chen, Zihang [1 ]
Ma, Ailong [1 ]
Zhong, Yanfei [1 ]
机构
[1] Wuhan Univ, LIESMARS, Wuhan 430074, Peoples R China
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
基金
中国国家自然科学基金;
关键词
IMAGERY;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multimodal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. As objects are the basis for complex relational reasoning, we propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way. To preserve refined spatial locations and semantics, SOBA leverages a segmentation network for object semantics generation. The object-guided attention aggregates object interior features via pseudo masks, and bidirectional cross-attention further models object external relations hierarchically. To optimize object counting, we propose a numerical difference loss that dynamically adds difference penalties, unifying the classification and regression tasks. Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision's complex analysis. The project page is at https://Junjue-Wang.github.io/homepage/EarthVQA.
引用
收藏
页码:5481 / 5489
页数:9
相关论文
共 50 条
  • [41] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
    Li, Caiyuan
    Du, Qinyi
    Wang, Qingqing
    Jin, Yaohui
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
  • [42] Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery
    Bazi, Yakoub
    Al Rahhal, Mohamad Mahmoud
    Mekhalfi, Mohamed Lamine
    Al Zuair, Mansour Abdulaziz
    Melgani, Farid
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [43] LIT-4-RSVQA: LIGHTWEIGHT TRANSFORMER-BASED VISUAL QUESTION ANSWERING IN REMOTE SENSING
    Hackel, Leonard
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 2231 - 2234
  • [44] Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration
    Agrawal, Mayank
    Jalal, Anand Singh
    Sharma, Himanshu
    COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
  • [45] Medical visual question answering based on question-type reasoning and semantic space constraint
    Wang, Meiling
    He, Xiaohai
    Liu, Luping
    Qing, Linbo
    Chen, Honggang
    Liu, Yan
    Ren, Chao
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2022, 131
  • [46] SEGMENTATION-GUIDED ATTENTION FOR VISUAL QUESTION ANSWERING FROM REMOTE SENSING IMAGES
    Tosato, Lucrezia
    Boussaid, Hichem
    Weissgerber, Flora
    Kurtz, Camille
    Wendling, Laurent
    Lobry, Sylvain
    IGARSS 2024-2024 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, IGARSS 2024, 2024, : 2750 - 2754
  • [47] VISUAL QUESTION ANSWERING IN REMOTE SENSING WITH CROSS-ATTENTION AND MULTIMODAL INFORMATION BOTTLENECK
    Songara, Jayesh
    Pande, Shivam
    Choudhury, Shabnam
    Banerjee, Biplab
    Velmurugan, Rajbabu
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6278 - 6281
  • [48] Scale-guided Fusion Inference Network for Remote Sensing Visual Question Answering
    Zhao E.-Y.
    Song N.
    Nie J.
    Wang X.
    Zheng C.-Y.
    Wei Z.-Q.
    Ruan Jian Xue Bao/Journal of Software, 2024, 35 (05): : 2133 - 2149
  • [49] A multi-scale contextual attention network for remote sensing visual question answering
    Feng, Jiangfan
    Wang, Hui
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 126
  • [50] EarthVQANet: Multi-task visual question answering for remote sensing image understanding
    Wang, Junjue
    Ma, Ailong
    Chen, Zihang
    Zheng, Zhuo
    Wan, Yuting
    Zhang, Liangpei
    Zhong, Yanfei
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2024, 212 : 422 - 439