Visual Experience-Based Question Answering with Complex Multimodal Environments

被引:0
|
作者
Kim, Incheol [1 ]
机构
[1] Kyonggi Univ, Dept Comp Sci, Suwon 16227, South Korea
关键词
D O I
10.1155/2020/8567271
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
This paper proposes a novel visual experience-based question answering problem (VEQA) and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand 3D scenes from successive partial input images, and answer natural language questions about its visual experiences in real time. Unlike the conventional visual question answering (VQA), the VEQA problem assumes both partial observability and dynamics of a complex multimodal environment. To address this VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system. The proposed system can generate more accurate scene graphs for dynamic environments with some uncertainty. Moreover, it can answer complex questions through knowledge reasoning with rich background knowledge. Results of experiments using a photo-realistic 3D simulated environment, AI2-THOR, and the VEQA benchmark dataset prove the high performance of the proposed system.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] Question Answering for Visual Navigation in Human-Centered Environments
    Kirilenko, Daniil E.
    Kovalev, Alexey K.
    Osipov, Evgeny
    Panov, Aleksandr, I
    ADVANCES IN SOFT COMPUTING (MICAI 2021), PT II, 2021, 13068 : 31 - 45
  • [32] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    IEEE ACCESS, 2020, 8 : 35662 - 35671
  • [33] HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language
    Parida, Shantipriya
    Abdulmumin, Idris
    Muhammad, Shamsuddeen Hassan
    Bose, Aneesh
    Kohli, Guneet Singh
    Ahmad, Ibrahim Said
    Kotwal, Ketan
    Sarkar, Sayan Deb
    Bojar, Ondrej
    Kakudi, Habeebah Adamu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10162 - 10183
  • [34] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    INFORMATION FUSION, 2020, 55 (55) : 116 - 126
  • [35] Bidirectional cascaded multimodal attention for multiple choice visual question answering
    Upadhyay, Sushmita
    Tripathy, Sanjaya Shankar
    MACHINE VISION AND APPLICATIONS, 2025, 36 (02)
  • [36] RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering
    Wang, Yuduo
    Ghamisi, Pedram
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [37] Multimodal Cross-guided Attention Networks for Visual Question Answering
    Liu, Haibin
    Gong, Shengrong
    Ji, Yi
    Yang, Jianyu
    Xing, Tengfei
    Liu, Chunping
    PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTER MODELING, SIMULATION AND ALGORITHM (CMSA 2018), 2018, 151 : 347 - 353
  • [38] Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data
    Zhu, He
    Togo, Ren
    Ogawa, Takahiro
    Haseyama, Miki
    ELECTRONICS, 2023, 12 (10)
  • [39] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 189 - 200
  • [40] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2023, : 189 - 200