Unifying Text, Tables, and Images for Multimodal Question Answering

被引:0
|
作者
Luo, Haohao [1 ]
Shen, Ying [1 ]
Deng, Yang [2 ]
机构
[1] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Guangzhou, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bimodal QA models, which limits their ability to effectively integrate information across all modalities and leverage the power of pretrained language models. To address these limitations, we propose a novel framework called UniMMQA, which unifies three different input modalities into a text-to-text format by employing position-enhanced table linearization and diversified image captioning techniques. Additionally, we enhance cross-modal reasoning by incorporating a multimodal rationale generator, which produces textual descriptions of cross-modal relations for adaptation into the text-to-text generation process. Experimental results on three MMQA benchmark datasets show the superiority of UniMMQA in both supervised and unsupervised settings.
引用
收藏
页码:9355 / 9367
页数:13
相关论文
共 50 条
  • [31] Multimodal representative answer extraction in community question answering
    Li, Ming
    Ma, Yating
    Li, Ying
    Bai, Yixue
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (09)
  • [32] Multimodal Graph Reasoning and Fusion for Video Question Answering
    Zhang, Shuai
    Wang, Xingfu
    Hawbani, Ammar
    Zhao, Liang
    Alsamhi, Saeed Hamood
    2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1410 - 1415
  • [33] Adversarial Multimodal Network for Movie Story Question Answering
    Yuan, Zhaoquan
    Sun, Siyuan
    Duan, Lixin
    Li, Changsheng
    Wu, Xiao
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1744 - 1756
  • [34] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
    Ben-younes, Hedi
    Cadene, Remi
    Cord, Matthieu
    Thome, Nicolas
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
  • [35] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
  • [36] MUREL: Multimodal Relational Reasoning for Visual Question Answering
    Cadene, Remi
    Ben-younes, Hedi
    Cord, Matthieu
    Thome, Nicolas
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
  • [37] Health-Oriented Multimodal Food Question Answering
    Wang, Jianghai
    Hu, Menghao
    Song, Yaguang
    Yang, Xiaoshan
    MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 191 - 203
  • [38] Multimodal Prompt Retrieval for Generative Visual Question Answering
    Ossowski, Timothy
    Hu, Junjie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2518 - 2535
  • [39] Dealing with spoken requests in a multimodal Question Answering system
    Gretter, Roberto
    Kouylekov, Milen
    Negri, Matteo
    ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, 2008, 5253 : 93 - 102
  • [40] A fine-tuned multimodal large model for power defect image-text question-answering
    Wang, Qiqi
    Zhang, Jie
    Du, Jianming
    Zhang, Ke
    Li, Rui
    Zhao, Feng
    Zou, Le
    Xie, Chengjun
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (12) : 9191 - 9203