Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

被引:0
|
作者
Li, Bingjia [1 ,2 ]
Wang, Jie [3 ]
Zhao, Minyi [1 ,2 ]
Zhou, Shuigeng [1 ,2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai 200438, Peoples R China
[2] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Shanghai 200438, Peoples R China
[3] ByteDance, Shanghai, Peoples R China
来源
COMPUTER VISION - ACCV 2022, PT IV | 2023年 / 13844卷
关键词
TextVQA; Scene text recognition; Multimodal information fusion; Contrastive learning;
D O I
10.1007/978-3-031-26316-3_39
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-based visual question answering (TextVQA) is to answer a text-related question by reading texts in a given image, which needs to jointly reason over three modalities-question, visual objects and scene texts in images. Most existing works leverage graph or sophisticated attention mechanisms to enhance the interaction between scene texts and visual objects. In this paper, observing that compared with visual objects, the question and scene text modalities are more important in TextVQA while both layouts and visual appearances of scene texts are also useful, we propose a two-stage multimodality fusion based method for high-performance TextVQA, which first semantically combines the question and OCR tokens to understand texts better and then integrates the combined results into visual features as additional information. Furthermore, to alleviate the redundancy and noise in the recognized scene texts, we develop a denoising module with contrastive loss to make our model focus on the relevant texts and thus obtain more robust features. Experiments on the TextVQA and ST-VQA datasets show that our method achieves competitive performance without any large-scale pre-training used in recent works, and outperforms the state-of-the-art methods after being pre-trained.
引用
收藏
页码:658 / 674
页数:17
相关论文
共 50 条
  • [1] Separate and Locate: Rethink the Text in Text-based Visual Question Answering
    Fang, Chengyang
    Li, Jiangnan
    Li, Liang
    Ma, Can
    Hu, Dayong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4378 - 4388
  • [2] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
    Li, Caiyuan
    Du, Qinyi
    Wang, Qingqing
    Jin, Yaohui
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
  • [3] Cascade Reasoning Network for Text-based Visual Question Answering
    Liu, Fen
    Xu, Guanghui
    Wu, Qi
    Du, Qing
    Jia, Wei
    Tan, Mingkui
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4060 - 4069
  • [4] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering
    Jin, Zan-Xia
    Wu, Heran
    Yang, Chun
    Zhou, Fang
    Qin, Jingyan
    Xiao, Lei
    Yin, Xu-Cheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1 - 12
  • [5] Text-instance graph: Exploring the relational semantics for text-based visual question answering
    Li, Xiangpeng
    Wu, Bo
    Song, Jingkuan
    Gao, Lianli
    Zeng, Pengpeng
    Gan, Chuang
    PATTERN RECOGNITION, 2022, 124
  • [6] Spatio-Temporal Two-stage Fusion for video question answering
    Xu, Feifei
    Zhu, Yitao
    Wang, Chun
    Cao, Yangze
    Zhong, Zheng
    Li, Xiongmin
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
  • [7] CNN for Text-Based Multiple Choice Question Answering
    Chaturvedi, Akshay
    Pandit, Onkar
    Garain, Utpal
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 272 - 277
  • [8] Fusion of Detected Objects in Text for Visual Question Answering
    Alberti, Chris
    Ling, Jeffrey
    Collins, Michael
    Reitter, David
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2131 - 2140
  • [9] Transformer models used for text-based question answering systems
    Nassiri, Khalid
    Akhloufi, Moulay
    APPLIED INTELLIGENCE, 2023, 53 (09) : 10602 - 10635
  • [10] Transformer models used for text-based question answering systems
    Khalid Nassiri
    Moulay Akhloufi
    Applied Intelligence, 2023, 53 : 10602 - 10635