QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document

被引：2

作者：

Mahamoud, Ibrahim Souleiman ^{[1
,2
]}

Coustaty, Mickael ^{[1
]}

Joseph, Aurelie ^{[2
]}

d'Andecy, Vincent Poulain ^{[2
]}

Ogier, Jean-Marc ^{[1
]}

机构：

[1] La Rochelle Univ, L3i Ave Michel Crepeau, F-17042 La Rochelle, France

[2] Yooz, 1 Rue Fleming, F-17000 La Rochelle, France

来源：

DOCUMENT ANALYSIS SYSTEMS, DAS 2022 | 2022年 / 13237卷

关键词：

Visual question answering; Multimodality; Attention mechanism;

D O I：

10.1007/978-3-031-06555-2_44

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The extraction of information from corporate documents is increasing in the research field both for its economic aspect and a scientific challenge. To extract this information the use of textual and visual content becomes unavoidable to understand the inherent information of the image. The information to be extracted is most often fixed beforehand (i.e. classification of words by date, total amount, etc.). The information to be extracted is evolving, so we would not like to be restricted to predefine word classes. We would like to question a document such as "which is the address of invoicing?" as we can have several addresses in an invoice. We formulate our request as a question and our model will try to answer. Our model got the result 77.65% on the Docvqa dataset while drastically reducing the number of model parameters to allow us to use it in an industrial context and we use an attention model using several modalities that help us in the interpertation of the results obtained. Our other contribution in this paper is a new dataset for Visual Question answering on corporate document of invoices from RVL-CDIP [8]. The public data on corporate documents are less present in the state-of-the-art, this contribution allow us to test our models to the invoice data with the VQA methods.

引用

页码：659 / 673

页数：15

共 50 条

[31] Fair Attention Network for Robust Visual Question Answering
Bi, Yandong
Jiang, Huajie
Hu, Yongli
Sun, Yanfeng
Yin, Baocai
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 7870 - 7881
[32] Learning Visual Question Answering by Bootstrapping Hard Attention
Malinowski, Mateusz
Doersch, Carl
Santoro, Adam
Battaglia, Peter
COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 3 - 20
[33] VISUAL QUESTION ANSWERING IN REMOTE SENSING WITH CROSS-ATTENTION AND MULTIMODAL INFORMATION BOTTLENECK
Songara, Jayesh
Pande, Shivam
Choudhury, Shabnam
Banerjee, Biplab
Velmurugan, Rajbabu
IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6278 - 6281
[34] Multimodal Dual Attention Memory for Video Story Question Answering
Kim, Kyung-Min
Choi, Seong-Ho
Kim, Jin-Hwa
Zhang, Byoung-Tak
COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 698 - 713
[35] Survey on Visual Question Answering
Bao X.-G.
Zhou C.-L.
Xiao K.-J.
Qin B.
Ruan Jian Xue Bao/Journal of Software, 2021, 32 (08): : 2522 - 2544
[36] VQA: Visual Question Answering
Agrawal, Aishwarya
Lu, Jiasen
Antol, Stanislaw
Mitchell, Margaret
Zitnick, C. Lawrence
Parikh, Devi
Batra, Dhruv
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31
[37] Multimodal Local Perception Bilinear Pooling for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
IEEE ACCESS, 2018, 6 : 57923 - 57932
[38] EduVQA: A multimodal Visual Question Answering framework for smart education
Xiao, Jiongen
Zhang, Zifeng
ALEXANDRIA ENGINEERING JOURNAL, 2025, 122 : 615 - 624
[39] Improving Visual Question Answering by Multimodal Gate Fusion Network
Xiang, Shenxiang
Chen, Qiaohong
Fang, Xian
Guo, Menghao
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[40] Co-attention graph convolutional network for visual question answering
Liu, Chuan
Tan, Ying-Ying
Xia, Tian-Tian
Zhang, Jiajing
Zhu, Ming
MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543

← 1 2 3 4 5 →