Unifying Text, Tables, and Images for Multimodal Question Answering

被引:0
|
作者
Luo, Haohao [1 ]
Shen, Ying [1 ]
Deng, Yang [2 ]
机构
[1] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Guangzhou, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bimodal QA models, which limits their ability to effectively integrate information across all modalities and leverage the power of pretrained language models. To address these limitations, we propose a novel framework called UniMMQA, which unifies three different input modalities into a text-to-text format by employing position-enhanced table linearization and diversified image captioning techniques. Additionally, we enhance cross-modal reasoning by incorporating a multimodal rationale generator, which produces textual descriptions of cross-modal relations for adaptation into the text-to-text generation process. Experimental results on three MMQA benchmark datasets show the superiority of UniMMQA in both supervised and unsupervised settings.
引用
收藏
页码:9355 / 9367
页数:13
相关论文
共 50 条
  • [1] MMCoQA: Conversational Question Answering over Text, Tables, and Images
    Li, Yongqi
    Li, Wenjie
    Nie, Liqiang
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4220 - 4231
  • [2] Unified Language Representation for Question Answering over Text, Tables, and Images
    Yu, Bowen
    Fu, Cheng
    Yu, Haiyang
    Huang, Fei
    Li, Yongbin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 4756 - 4765
  • [3] Representations for Question Answering from Documents with Tables and Text
    Zayats, Vicky
    Toutanova, Kristina
    Ostendorf, Mari
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2895 - 2906
  • [4] Parameter-Efficient Abstractive Question Answering over Tables or Text
    Pal, Vaishali
    Kanoulas, Evangelos
    de Rijke, Maarten
    PROCEEDINGS OF THE SECOND DIALDOC WORKSHOP ON DOCUMENT-GROUNDED DIALOGUE AND CONVERSATIONAL QUESTION ANSWERING (DIALDOC 2022), 2022, : 41 - 53
  • [5] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 189 - 200
  • [6] Multimodal grid features and cell pointers for scene text visual question answering
    Gomez, Lluis
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Rusinol, Marcal
    Valveny, Ernest
    Karatzas, Dimosthenis
    PATTERN RECOGNITION LETTERS, 2021, 150 : 242 - 249
  • [7] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2023, : 189 - 200
  • [8] From text to multimodal: a survey of adversarial example generation in question answering systems
    Yigit, Gulsum
    Amasyali, Mehmet Fatih
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (12) : 7165 - 7204
  • [9] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    arXiv, 2023,
  • [10] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792