Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

被引：0

作者：

Li, Bingjia ^{[1
,2
]}

Wang, Jie ^{[3
]}

Zhao, Minyi ^{[1
,2
]}

Zhou, Shuigeng ^{[1
,2
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai 200438, Peoples R China

[2] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Shanghai 200438, Peoples R China

[3] ByteDance, Shanghai, Peoples R China

来源：

COMPUTER VISION - ACCV 2022, PT IV | 2023年 / 13844卷

关键词：

TextVQA; Scene text recognition; Multimodal information fusion; Contrastive learning;

D O I：

10.1007/978-3-031-26316-3_39

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-based visual question answering (TextVQA) is to answer a text-related question by reading texts in a given image, which needs to jointly reason over three modalities-question, visual objects and scene texts in images. Most existing works leverage graph or sophisticated attention mechanisms to enhance the interaction between scene texts and visual objects. In this paper, observing that compared with visual objects, the question and scene text modalities are more important in TextVQA while both layouts and visual appearances of scene texts are also useful, we propose a two-stage multimodality fusion based method for high-performance TextVQA, which first semantically combines the question and OCR tokens to understand texts better and then integrates the combined results into visual features as additional information. Furthermore, to alleviate the redundancy and noise in the recognized scene texts, we develop a denoising module with contrastive loss to make our model focus on the relevant texts and thus obtain more robust features. Experiments on the TextVQA and ST-VQA datasets show that our method achieves competitive performance without any large-scale pre-training used in recent works, and outperforms the state-of-the-art methods after being pre-trained.

引用

页码：658 / 674

页数：17

共 50 条

[1] Separate and Locate: Rethink the Text in Text-based Visual Question Answering
Fang, Chengyang
Li, Jiangnan
Li, Liang
Ma, Can
Hu, Dayong
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4378 - 4388
[2] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
Li, Caiyuan
Du, Qinyi
Wang, Qingqing
Jin, Yaohui
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
[3] Cascade Reasoning Network for Text-based Visual Question Answering
Liu, Fen
Xu, Guanghui
Wu, Qi
Du, Qing
Jia, Wei
Tan, Mingkui
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4060 - 4069
[4] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering
Jin, Zan-Xia
Wu, Heran
Yang, Chun
Zhou, Fang
Qin, Jingyan
Xiao, Lei
Yin, Xu-Cheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1 - 12
[5] Text-instance graph: Exploring the relational semantics for text-based visual question answering
Li, Xiangpeng
Wu, Bo
Song, Jingkuan
Gao, Lianli
Zeng, Pengpeng
Gan, Chuang
PATTERN RECOGNITION, 2022, 124
[6] Spatio-Temporal Two-stage Fusion for video question answering
Xu, Feifei
Zhu, Yitao
Wang, Chun
Cao, Yangze
Zhong, Zheng
Li, Xiongmin
COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
[7] CNN for Text-Based Multiple Choice Question Answering
Chaturvedi, Akshay
Pandit, Onkar
Garain, Utpal
PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 272 - 277
[8] Fusion of Detected Objects in Text for Visual Question Answering
Alberti, Chris
Ling, Jeffrey
Collins, Michael
Reitter, David
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2131 - 2140
[9] Transformer models used for text-based question answering systems
Nassiri, Khalid
Akhloufi, Moulay
APPLIED INTELLIGENCE, 2023, 53 (09) : 10602 - 10635
[10] Transformer models used for text-based question answering systems
Khalid Nassiri
Moulay Akhloufi
Applied Intelligence, 2023, 53 : 10602 - 10635

← 1 2 3 4 5 →