Cascade Reasoning Network for Text-based Visual Question Answering

被引:38
|
作者
Liu, Fen [1 ]
Xu, Guanghui [1 ]
Wu, Qi [2 ]
Du, Qing [1 ]
Jia, Wei [3 ]
Tan, Mingkui [1 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] Univ Adelaide, Adelaide, SA, Australia
[3] CVTE, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-based VQA; Multimodal Information; Progressive Attention; Reasoning Graph;
D O I
10.1145/3394171.3413924
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvga.
引用
收藏
页码:4060 / 4069
页数:10
相关论文
共 50 条
  • [1] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
    Li, Caiyuan
    Du, Qinyi
    Wang, Qingqing
    Jin, Yaohui
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
  • [2] Separate and Locate: Rethink the Text in Text-based Visual Question Answering
    Fang, Chengyang
    Li, Jiangnan
    Li, Liang
    Ma, Can
    Hu, Dayong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4378 - 4388
  • [3] Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering
    Li, Hao
    Huang, Jinfa
    Jin, Peng
    Song, Guoli
    Wu, Qi
    Chen, Jie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3367 - 3382
  • [4] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering
    Jin, Zan-Xia
    Wu, Heran
    Yang, Chun
    Zhou, Fang
    Qin, Jingyan
    Xiao, Lei
    Yin, Xu-Cheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1 - 12
  • [5] So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering
    Zheng, Wenbo
    Yan, Lan
    Wang, Fei-Yue
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2024, 54 (02): : 854 - 865
  • [6] Text-instance graph: Exploring the relational semantics for text-based visual question answering
    Li, Xiangpeng
    Wu, Bo
    Song, Jingkuan
    Gao, Lianli
    Zeng, Pengpeng
    Gan, Chuang
    PATTERN RECOGNITION, 2022, 124
  • [7] CNN for Text-Based Multiple Choice Question Answering
    Chaturvedi, Akshay
    Pandit, Onkar
    Garain, Utpal
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 272 - 277
  • [8] Towards Reasoning Ability in Scene Text Visual Question Answering
    Wang, Qingqing
    Xiao, Liqiang
    Lu, Yue
    Jin, Yaohui
    He, Hao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289
  • [9] Transformer models used for text-based question answering systems
    Nassiri, Khalid
    Akhloufi, Moulay
    APPLIED INTELLIGENCE, 2023, 53 (09) : 10602 - 10635
  • [10] Transformer models used for text-based question answering systems
    Khalid Nassiri
    Moulay Akhloufi
    Applied Intelligence, 2023, 53 : 10602 - 10635