So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering

被引:1
|
作者
Zheng, Wenbo [1 ,2 ]
Yan, Lan [3 ,4 ]
Wang, Fei-Yue [5 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan 430070, Peoples R China
[2] Wuhan Univ Technol, Sanya Sci & Educ Innovat Pk, Sanya 572000, Peoples R China
[3] Hunan Univ, Coll Comp Sci & Engn, Changsha 410082, Hunan, Peoples R China
[4] Natl Supercomp Ctr, Changsha 410082, Hunan, Peoples R China
[5] Chinese Acad Sci, Inst Automat, State Key Lab Management & Control Complex Syst, Beijing 100190, Peoples R China
基金
海南省自然科学基金;
关键词
Graph attention; graph reasoning; multimodal graph; self-attention; text-based visual question answering; ATTENTION; LANGUAGE; VISION;
D O I
10.1109/TSMC.2023.3319964
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While texts related to images convey fundamental messages for scene understanding and reasoning, text-based visual question answering tasks concentrate on visual questions that require reading texts from images. However, most current methods add multimodal features that are independently extracted from a given image into a reasoning model without considering their inter-and intra-relationships according to three modalities (i.e., scene texts, questions, and images). To this end, we propose a novel text-based visual question answering model, multimodal graph reasoning. Our model first extracts intramodality relationships by taking the representations from identical modalities as semantic graphs. Then, we present graph multihead self-attention, which boosts each graph representation through graph-by-graph aggregation to capture the intermodality relationship. It is a case of "so many heads, so many wits" in the sense that as more semantic graphs are involved in this process, each graph representation becomes more effective. Finally, these representations are reprojected, and we perform answer prediction with their outputs. The experimental results demonstrate that our approach realizes substantially better performance compared with other state-of-the-art models.
引用
收藏
页码:854 / 865
页数:12
相关论文
共 13 条
  • [1] Cascade Reasoning Network for Text-based Visual Question Answering
    Liu, Fen
    Xu, Guanghui
    Wu, Qi
    Du, Qing
    Jia, Wei
    Tan, Mingkui
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4060 - 4069
  • [2] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
    Li, Caiyuan
    Du, Qinyi
    Wang, Qingqing
    Jin, Yaohui
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
  • [3] Text-instance graph: Exploring the relational semantics for text-based visual question answering
    Li, Xiangpeng
    Wu, Bo
    Song, Jingkuan
    Gao, Lianli
    Zeng, Pengpeng
    Gan, Chuang
    PATTERN RECOGNITION, 2022, 124
  • [4] Separate and Locate: Rethink the Text in Text-based Visual Question Answering
    Fang, Chengyang
    Li, Jiangnan
    Li, Liang
    Ma, Can
    Hu, Dayong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4378 - 4388
  • [5] Answering "Why Empty?" and "Why So Many?" queries in graph databases
    Vasilyeva, Elena
    Thiele, Maik
    Bornhoevd, Christof
    Lehner, Wolfgang
    JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2016, 82 (01) : 3 - 22
  • [6] Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering
    Li, Hao
    Huang, Jinfa
    Jin, Peng
    Song, Guoli
    Wu, Qi
    Chen, Jie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3367 - 3382
  • [7] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering
    Jin, Zan-Xia
    Wu, Heran
    Yang, Chun
    Zhou, Fang
    Qin, Jingyan
    Xiao, Lei
    Yin, Xu-Cheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1 - 12
  • [8] VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering
    Wang, Yanan
    Yasunaga, Michihiro
    Ren, Hongyu
    Wada, Shinya
    Leskovec, Jure
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21525 - 21535
  • [9] Visual Question Answering reasoning with external knowledge based on bimodal graph neural network
    Yang, Zhenyu
    Wu, Lei
    Wen, Peian
    Chen, Peng
    ELECTRONIC RESEARCH ARCHIVE, 2023, 31 (04): : 1948 - 1965
  • [10] Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering
    Li, Bingjia
    Wang, Jie
    Zhao, Minyi
    Zhou, Shuigeng
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 658 - 674