Information extraction from scanned invoice images using text analysis and layout features

被引:7
作者
Ha, H. T. [1 ]
Horak, A. [1 ]
机构
[1] Masaryk Univ, Fac Informat, Nat Language Proc Ctr, Bot 68a, Brno 60200, Czech Republic
关键词
OCR; Information extraction; Scanned documents; Document metadata; Invoice metadata extraction; Metadata indexing;
D O I
10.1016/j.image.2021.116601
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
While storing invoice content as metadata to avoid paper document processing may be the future trend, almost all of daily issued invoices are still printed on paper or generated in digital formats such as PDFs. In this paper, we introduce the OCRMiner system for information extraction from scanned document images which is based on text analysis techniques in combination with layout features to extract indexing metadata of (semi-)structured documents. The system is designed to process the document in a similar way a human reader uses, i.e. to employ different layout and text attributes in a coordinated decision. The system consists of a set of interconnected modules that start with (possibly erroneous) character-based output from a standard OCR system and allow to apply different techniques and to expand the extracted knowledge at each step. Using an open source OCR, the system is able to recover the invoice data in 90% for English and in 88% for the Czech set.
引用
收藏
页数:11
相关论文
共 44 条
[1]  
Arkhipov M, 2019, 7TH WORKSHOP ON BALTO-SLAVIC NATURAL LANGUAGE PROCESSING (BSNLP'2019), P89
[2]  
Aslan Enes, 2016, VISIGRAPP 2016. 11th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. Proceedings: VISAPP 2016, P392
[3]  
Barrentine A., 2020, LIBPOSTAL
[4]  
Bart Evgeniy, 2010, P 9 IAPR INT WORKSH, P175
[5]  
Bayer TA, 1997, PROC INT CONF DOC, P740, DOI 10.1109/ICDAR.1997.620607
[6]   Analysis and understanding of multi-class invoices [J].
F. Cesarini ;
E. Francesconi ;
M. Gori ;
G. Soda .
Document Analysis and Recognition, 2003, 6 (2) :102-114
[7]  
Cesarini F, 1997, EIGHTH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, P596, DOI 10.1109/DEXA.1997.617381
[8]   Future paradigms of automated processing of business documents [J].
Cristani, Matteo ;
Bertolaso, Andrea ;
Scannapieco, Simone ;
Tomazzoli, Claudio .
INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT, 2018, 40 :67-75
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Esser Daniel, 2014, 16th International Conference on Enterprise Information Systems (ICEIS 2014). Proceedings, P293