QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document

被引：2

作者：

Mahamoud, Ibrahim Souleiman ^{[1
,2
]}

Coustaty, Mickael ^{[1
]}

Joseph, Aurelie ^{[2
]}

d'Andecy, Vincent Poulain ^{[2
]}

Ogier, Jean-Marc ^{[1
]}

机构：

[1] La Rochelle Univ, L3i Ave Michel Crepeau, F-17042 La Rochelle, France

[2] Yooz, 1 Rue Fleming, F-17000 La Rochelle, France

来源：

DOCUMENT ANALYSIS SYSTEMS, DAS 2022 | 2022年 / 13237卷

关键词：

Visual question answering; Multimodality; Attention mechanism;

D O I：

10.1007/978-3-031-06555-2_44

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The extraction of information from corporate documents is increasing in the research field both for its economic aspect and a scientific challenge. To extract this information the use of textual and visual content becomes unavoidable to understand the inherent information of the image. The information to be extracted is most often fixed beforehand (i.e. classification of words by date, total amount, etc.). The information to be extracted is evolving, so we would not like to be restricted to predefine word classes. We would like to question a document such as "which is the address of invoicing?" as we can have several addresses in an invoice. We formulate our request as a question and our model will try to answer. Our model got the result 77.65% on the Docvqa dataset while drastically reducing the number of model parameters to allow us to use it in an industrial context and we use an attention model using several modalities that help us in the interpertation of the results obtained. Our other contribution in this paper is a new dataset for Visual Question answering on corporate document of invoices from RVL-CDIP [8]. The public data on corporate documents are less present in the state-of-the-art, this contribution allow us to test our models to the invoice data with the VQA methods.

引用

页码：659 / 673

页数：15

共 50 条

[21] Visual Question Answering
Nada, Ahmed
Chen, Min
2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 6 - 10
[22] Question Modifiers in Visual Question Answering
Britton, William
Sarkhel, Somdeb
Venugopal, Deepak
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
[23] Erasing-based Attention Learning for Visual Question Answering
Liu, Fei
Liu, Jing
Hong, Richang
Lu, Hanqing
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1175 - 1183
[24] ICDAR 2021 Competition on Document Visual Question Answering
Tito, Ruben
Mathew, Minesh
Jawahar, C., V
Valveny, Ernest
Karatzas, Dimosthenis
DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT IV, 2021, 12824 : 635 - 649
[25] Co-Attention Network With Question Type for Visual Question Answering
Yang, Chao
Jiang, Mengqi
Jiang, Bin
Zhou, Weixin
Li, Keqin
IEEE ACCESS, 2019, 7 : 40771 - 40781
[26] Local relation network with multilevel attention for visual question answering
Sun, Bo
Yao, Zeng
Zhang, Yinghui
Yu, Lejun
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
[27] Deep Modular Bilinear Attention Network for Visual Question Answering
Yan, Feng
Silamu, Wushouer
Li, Yanbing
SENSORS, 2022, 22 (03)
[28] Multi-view Attention Networks for Visual Question Answering
Li, Min
Bai, Zongwen
Deng, Jie
2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 788 - 794
[29] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
Gu, Geonmo
Kim, Seong Tae
Ro, Yong Man
2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
[30] Triple attention network for sentimental visual question answering
Ruwa, Nelson
Mao, Qirong
Song, Heping
Jia, Hongjie
Dong, Ming
COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189

← 1 2 3 4 5 →