A probabilistic approach to printed document understanding

被引:24
|
作者
Medvet, Eric [1 ]
Bartoli, Alberto [1 ]
Davanzo, Giorgio [1 ]
机构
[1] Univ Trieste, DEEI, I-34127 Trieste, Italy
关键词
Document understanding; Automatic model upgrading; Invoice analysis; Maximum likelihood;
D O I
10.1007/s10032-010-0137-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results-e.g., a success rate often greater than 90% even for classes with just two samples.
引用
收藏
页码:335 / 347
页数:13
相关论文
共 35 条
  • [21] Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
    Powalski, Rafal
    Borchmann, Lukasz
    Jurkiewicz, Dawid
    Dwojak, Tomasz
    Pietruszka, Michal
    Palka, Gabriela
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 732 - 747
  • [22] Document Understanding-Based Design Support: Application of Language Model for Design Knowledge Extraction
    Qiu, Yunjian
    Jin, Yan
    JOURNAL OF MECHANICAL DESIGN, 2023, 145 (12)
  • [23] GeoContrastNet: Contrastive Key-Value Edge Learning for Language-Agnostic Document Understanding
    Biescas, Nil
    Boned, Carlos
    Llados, Josep
    Biswas, Sanket
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT I, 2024, 14804 : 294 - 310
  • [24] VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
    Abramovich, Ofir
    Nayman, Niv
    Fogell, Sharon
    Lavi, Inbal
    Litman, Ron
    Tsiper, Shahar
    Tichauer, Royee
    Appalaraju, Srikar
    Mazor, Shai
    Manmatha, R.
    COMPUTER VISION - ECCV 2024, PT VIII, 2025, 15066 : 241 - 259
  • [25] DOCUMENT UNDERSTANDING-BASED DESIGN SUPPORT: LANGUAGE MODEL BASED DESIGN KNOWLEDGE EXTRACTION
    Qiu, Yunjian
    Jin, Yan
    PROCEEDINGS OF ASME 2023 INTERNATIONAL DESIGN ENGINEERING TECHNICAL CONFERENCES AND COMPUTERS AND INFORMATION IN ENGINEERING CONFERENCE, IDETC-CIE2023, VOL 3A, 2023,
  • [26] Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning
    Meneghetti, Niccolo
    Kennedy, Oliver
    Gatterbauer, Wolfgang
    SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 573 - 586
  • [27] Recommendations Using Information from Multiple Association Rules: A Probabilistic Approach
    Ghoshal, Abhijeet
    Menon, Syam
    Sarkar, Sumit
    INFORMATION SYSTEMS RESEARCH, 2015, 26 (03) : 532 - 551
  • [28] A Probabilistic Approach based on a Finite Mixture Model of Multivariate Beta Distributions
    Manouchehri, Narges
    Bouguila, Nizar
    PROCEEDINGS OF THE 21ST INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS (ICEIS), VOL 1, 2019, : 373 - 380
  • [29] Underwater positioning by kernel principal component analysis based probabilistic approach
    Chan, Sheng-Chih
    Lee, Kun-Chou
    Lin, Tsung-Nan
    Fang, Ming-Chung
    APPLIED ACOUSTICS, 2013, 74 (10) : 1153 - 1159
  • [30] Estimation of fatigue S-N curves of welded joints using advanced probabilistic approach
    D'Angelo, Luca
    Nussbaumer, Alain
    INTERNATIONAL JOURNAL OF FATIGUE, 2017, 97 : 98 - 113