Unifying Text, Tables, and Images for Multimodal Question Answering

被引：0

作者：

Luo, Haohao ^{[1
]}

Shen, Ying ^{[1
]}

Deng, Yang ^{[2
]}

机构：

[1] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Guangzhou, Peoples R China

[2] Natl Univ Singapore, Singapore, Singapore

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023) | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bimodal QA models, which limits their ability to effectively integrate information across all modalities and leverage the power of pretrained language models. To address these limitations, we propose a novel framework called UniMMQA, which unifies three different input modalities into a text-to-text format by employing position-enhanced table linearization and diversified image captioning techniques. Additionally, we enhance cross-modal reasoning by incorporating a multimodal rationale generator, which produces textual descriptions of cross-modal relations for adaptation into the text-to-text generation process. Experimental results on three MMQA benchmark datasets show the superiority of UniMMQA in both supervised and unsupervised settings.

引用

页码：9355 / 9367

页数：13

共 50 条

[31] Multimodal representative answer extraction in community question answering
Li, Ming
Ma, Yating
Li, Ying
Bai, Yixue
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (09)
[32] Multimodal Graph Reasoning and Fusion for Video Question Answering
Zhang, Shuai
Wang, Xingfu
Hawbani, Ammar
Zhao, Liang
Alsamhi, Saeed Hamood
2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1410 - 1415
[33] Adversarial Multimodal Network for Movie Story Question Answering
Yuan, Zhaoquan
Sun, Siyuan
Duan, Lixin
Li, Changsheng
Wu, Xiao
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1744 - 1756
[34] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
Ben-younes, Hedi
Cadene, Remi
Cord, Matthieu
Thome, Nicolas
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
[35] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
Hussain, Afzaal
Maqsood, Ifrah
Shahzad, Muhammad
Fraz, Muhammad Moazam
2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
[36] MUREL: Multimodal Relational Reasoning for Visual Question Answering
Cadene, Remi
Ben-younes, Hedi
Cord, Matthieu
Thome, Nicolas
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
[37] Health-Oriented Multimodal Food Question Answering
Wang, Jianghai
Hu, Menghao
Song, Yaguang
Yang, Xiaoshan
MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 191 - 203
[38] Multimodal Prompt Retrieval for Generative Visual Question Answering
Ossowski, Timothy
Hu, Junjie
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2518 - 2535
[39] Dealing with spoken requests in a multimodal Question Answering system
Gretter, Roberto
Kouylekov, Milen
Negri, Matteo
ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, 2008, 5253 : 93 - 102
[40] A fine-tuned multimodal large model for power defect image-text question-answering
Wang, Qiqi
Zhang, Jie
Du, Jianming
Zhang, Ke
Li, Rui
Zhao, Feng
Zou, Le
Xie, Chengjun
SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (12) : 9191 - 9203

← 1 2 3 4 5 →