A Digitization Pipeline for Mixed-Typed Documents Using Machine Learning and Optical Character Recognition

被引：1

作者：

Matschak, Tizian ^{[1
]}

Rampold, Florian ^{[1
]}

Hellmeier, Malte ^{[2
]}

Prinz, Christoph ^{[1
]}

Trang, Simon ^{[1
]}

机构：

[1] Univ Goettingen, Wilhelmspl 1, D-37073 Gottingen, Germany

[2] Fraunhofer ISST, Emil Figge Str 91, D-44227 Dortmund, Germany

来源：

TRANSDISCIPLINARY REACH OF DESIGN SCIENCE RESEARCH, DESRIST 2022 | 2022年 / 13229卷

关键词：

Document image analysis; Optical character recognition; Digitization; Machine learning; Preprocessing; Postprocessing; DESIGN SCIENCE RESEARCH; OCR;

D O I：

10.1007/978-3-031-06516-3_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Although digitization is advancing rapidly, a large amount of data processed by companies is in printed format. Technologies such as Optical Character Recognition (OCR) support the transformation of printed text into machine-readable content. However, OCR struggles when data on documents is highly unstructured and includes non-text objects. This, e.g., applies to documents such as medical prescriptions. Leveraging Design Science Research (DSR), we propose a flexible processing pipeline that can deal with character recognition on the one hand and object detection on the other hand. To do so, we derive Design Requirements (DR) in cooperation with a practitioner doing prescription billing in the healthcare domain. We then developed a prototype blueprint that is applicable to similar problem formulations. Overall, we contribute to research and practice in multiple ways. First, we provide evidence for selected OCR methods provided by previous research. Second, we design a machine-learning-based digitization pipeline for printed documents containing both text and non-text objects in the context of medical prescriptions. Third, we derive a nascent design pattern for this type of document digitization. These patterns are the foundation for further research and can support the development of innovative information systems leading to more efficient decision making and thus to economic resource usage.

引用

页码：195 / 207

页数：13

共 39 条

[1]

ABDA B.D.A.e.V., 2021, Arzneimittel 2020: Weniger Rezepte, aber hohere GKV-Ausgaben im Pandemie-Jahr

[2]

Akram S., 2010, INT J COMPUT APPL IJ, V10, P35, DOI [10.5120/1475-1991, DOI 10.5120/1475-1991]

[3]

Alday R.B., 2013, IISA 2013, P1

[4] Medical prescription classification: a NLP-based approach [J].

Carchiolo, Vincenza ;

Longheu, Alessandro ;

Reitano, Giuseppa ;

Zagarella, Luca .

PROCEEDINGS OF THE 2019 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2019, :605-609

[5]

Chaudhuri A, 2017, STUD FUZZ SOFT COMP, V352, P9

[6] Designing a Real-Time Integrated First Responder Health and Environmental Monitoring Dashboard [J].

Fruhling, Ann ;

Hall, Margeret ;

Medcalf, Sharon ;

Yoder, Aaron .

DESIGNING FOR DIGITAL TRANSFORMATION: CO-CREATING SERVICES WITH CITIZENS AND INDUSTRY, DESRIST 2020, 2020, 12388 :28-34

[7] Handwritten Signature Forgery Detection using Convolutional Neural Networks [J].

Gideon, Jerome S. ;

Kandulna, Anurag ;

Kujur, Aron Abhishek ;

Diana, A. ;

Raimond, Kumudha .

8TH INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING & COMMUNICATIONS (ICACC-2018), 2018, 143 :978-987

[8] Detection and Segmentation of Antialiased Text in Screen Images [J].

Gleichman, Sivan ;

Ophir, Boaz ;

Geva, Amir ;

Marder, Mattias ;

Barkan, Ella ;

Packer, Eli .

11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, :424-428

[9] POSITIONING AND PRESENTING DESIGN SCIENCE RESEARCH FOR MAXIMUM IMPACT [J].

Gregor, Shirley ;

Hevner, Alan R. .

MIS QUARTERLY, 2013, 37 (02) :337-+

[10] OCR binarization and image pre-processing for searching historical documents [J].

Gupta, Maya R. ;

Jacobson, Nathaniel P. ;

Garcia, Eric K. .

PATTERN RECOGNITION, 2007, 40 (02) :389-397

← 1 2 3 4 →