A probabilistic approach to printed document understanding

被引:24
|
作者
Medvet, Eric [1 ]
Bartoli, Alberto [1 ]
Davanzo, Giorgio [1 ]
机构
[1] Univ Trieste, DEEI, I-34127 Trieste, Italy
关键词
Document understanding; Automatic model upgrading; Invoice analysis; Maximum likelihood;
D O I
10.1007/s10032-010-0137-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results-e.g., a success rate often greater than 90% even for classes with just two samples.
引用
收藏
页码:335 / 347
页数:13
相关论文
共 50 条
  • [31] Structural Feature Based Approach for Script Identification from Printed Indian Document
    Obaidullah, Sk Md
    Mondal, Anamika
    Roy, Kaushik
    2014 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2014, : 120 - 124
  • [32] Probabilistic document correlation model
    Jia, Xiping
    Peng, Hong
    CIS WORKSHOPS 2007: INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY WORKSHOPS, 2007, : 433 - 436
  • [33] Probabilistic Interval Forecasts: An Individual Differences Approach to Understanding Forecast Communication
    Grounds, Margaret A.
    Joslyn, Susan
    Otsuka, Kyoko
    ADVANCES IN METEOROLOGY, 2017, 2017
  • [34] Printed Character Database Analysis Based Printed Document Examination
    Yao, Yong
    Wang, Weihua
    Zhang, Dongfang
    Guo, Hongyan
    INFORMATION TECHNOLOGY APPLICATIONS IN INDUSTRY II, PTS 1-4, 2013, 411-414 : 1260 - +
  • [35] Understanding Probabilistic Programs
    Katoen, Joost-Pieter
    Gretz, Friedrich
    Jansen, Nils
    Kaminski, Benjamin Lucien
    Olmedo, Federico
    CORRECT SYSTEM DESIGN: SYMPOSIUM IN HONOR OF ERNST-RUDIGER OLDEROG ON THE OCCASION OF HIS 60TH BIRTHDAY, 2015, 9360 : 15 - 32
  • [36] Retaining hyperlinks in printed hypermedia document
    Wan, Ernest
    Robertson, Philip
    Brook, John
    Bruce, Stephen
    Armitage, Kristine
    Computer Networks, 1999, 31 (11): : 1509 - 1524
  • [37] Printed Arabic document recognition system
    Jin, JM
    Wang, H
    Ding, XQ
    Peng, LR
    DOCUMENT RECOGNITION AND RETRIEVAL XII, 2005, 5676 : 48 - 55
  • [38] Retaining hyperlinks in printed hypermedia document
    Wan, E
    Robertson, P
    Brook, J
    Bruce, S
    Armitage, K
    COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING, 1999, 31 (11-16): : 1509 - 1524
  • [39] Optical watermarking for printed document authentication
    Huang, Sheng
    Wu, Jian Kang
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2007, 2 (02) : 164 - 173
  • [40] Retaining hyperlinks in printed hypermedia document
    Wan, E
    Robertson, P
    Brook, J
    Bruce, S
    Armitage, K
    PROCEEDINGS OF THE EIGHTH INTERNATIONAL WORLD WIDE WEB CONFERENCE, 1999, : 431 - 446