QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document

被引:2
|
作者
Mahamoud, Ibrahim Souleiman [1 ,2 ]
Coustaty, Mickael [1 ]
Joseph, Aurelie [2 ]
d'Andecy, Vincent Poulain [2 ]
Ogier, Jean-Marc [1 ]
机构
[1] La Rochelle Univ, L3i Ave Michel Crepeau, F-17042 La Rochelle, France
[2] Yooz, 1 Rue Fleming, F-17000 La Rochelle, France
来源
关键词
Visual question answering; Multimodality; Attention mechanism;
D O I
10.1007/978-3-031-06555-2_44
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The extraction of information from corporate documents is increasing in the research field both for its economic aspect and a scientific challenge. To extract this information the use of textual and visual content becomes unavoidable to understand the inherent information of the image. The information to be extracted is most often fixed beforehand (i.e. classification of words by date, total amount, etc.). The information to be extracted is evolving, so we would not like to be restricted to predefine word classes. We would like to question a document such as "which is the address of invoicing?" as we can have several addresses in an invoice. We formulate our request as a question and our model will try to answer. Our model got the result 77.65% on the Docvqa dataset while drastically reducing the number of model parameters to allow us to use it in an industrial context and we use an attention model using several modalities that help us in the interpertation of the results obtained. Our other contribution in this paper is a new dataset for Visual Question answering on corporate document of invoices from RVL-CDIP [8]. The public data on corporate documents are less present in the state-of-the-art, this contribution allow us to test our models to the invoice data with the VQA methods.
引用
收藏
页码:659 / 673
页数:15
相关论文
共 50 条
  • [31] Erasing-based Attention Learning for Visual Question Answering
    Liu, Fei
    Liu, Jing
    Hong, Richang
    Lu, Hanqing
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1175 - 1183
  • [32] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
    Cai, Linqin
    Xu, Nuoying
    Tian, Hang
    Chen, Kejia
    Fan, Haodu
    NEURAL PROCESSING LETTERS, 2023, 55 (09) : 11921 - 11943
  • [33] Counting Attention Based on Classification Confidence for Visual Question Answering
    Chen, Mingqin
    Wang, Yilei
    Chen, Shan
    Wu, Yingjie
    2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 1173 - 1179
  • [34] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
    Linqin Cai
    Nuoying Xu
    Hang Tian
    Kejia Chen
    Haodu Fan
    Neural Processing Letters, 2023, 55 : 11921 - 11943
  • [35] Question Modifiers in Visual Question Answering
    Britton, William
    Sarkhel, Somdeb
    Venugopal, Deepak
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
  • [36] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [37] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
    Ben-younes, Hedi
    Cadene, Remi
    Cord, Matthieu
    Thome, Nicolas
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
  • [38] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
  • [39] ICDAR 2021 Competition on Document Visual Question Answering
    Tito, Ruben
    Mathew, Minesh
    Jawahar, C., V
    Valveny, Ernest
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT IV, 2021, 12824 : 635 - 649
  • [40] MUREL: Multimodal Relational Reasoning for Visual Question Answering
    Cadene, Remi
    Ben-younes, Hedi
    Cord, Matthieu
    Thome, Nicolas
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998