Named Entity Recognition in Semi Structured Documents Using Neural Tensor Networks

被引:1
作者
Shehzad, Khurram [1 ]
Ul-Hasan, Adnan [2 ]
Malik, Muhammad Imran [1 ]
Shafait, Faisal [1 ,2 ]
机构
[1] Natl Univ Sci & Technol NUST, Sch Elect Engn & Comp Sci, Islamabad, Pakistan
[2] Natl Ctr Artificial Intelligence, Deep Learning Lab, Islamabad, Pakistan
来源
DOCUMENT ANALYSIS SYSTEMS | 2020年 / 12116卷
关键词
Named Entity Recognition; Neural Tensor Networks; Semi structured documents;
D O I
10.1007/978-3-030-57058-3_28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information Extraction and Named Entity Recognition algorithms derive major applications related to many practical document analysis system. Semi structured documents pose several challenges when it comes to extract relevant information from these documents. The state-of-the-art methods heavily rely on feature engineering to perform layout-specific extraction of information and therefore do not generalize well. Extracting information without taking the document layout into consideration is required as a first step to develop a general solution to this problem. To address this challenge, we propose a deep learning based pipeline to extract information from documents. For this purpose, we define 'information' to be a set of entities that have a label and a corresponding value, e.g., application number: ADNF8932NF and submission date: 15FEB19. We form relational triplets by connecting one entity to another via a relationship, such as (max temperature, is, 100 degrees) and train a neural tensor network that is well-suited for this kind of data to predict high confidence scores for true triplets. Up to 96% test accuracy on real world documents from publicly available GHEGA dataset demonstrate the effectiveness of our approach.
引用
收藏
页码:398 / 409
页数:12
相关论文
共 19 条
[1]  
Bansal T, 2017, Arxiv, DOI arXiv:1706.07179
[2]   The OCRopus open source OCR system [J].
Breuel, Thomas M. .
DOCUMENT RECOGNITION AND RETRIEVAL XV, 2008, 6815
[3]  
Cai CH, 2017, IEEE IJCNN, P2136, DOI 10.1109/IJCNN.2017.7966113
[4]   Analysis and understanding of multi-class invoices [J].
F. Cesarini ;
E. Francesconi ;
M. Gori ;
G. Soda .
Document Analysis and Recognition, 2003, 6 (2) :102-114
[5]  
Dengel AR, 2003, PROC INT CONF DOC, P1026
[6]   Automatic Indexing of Scanned Documents - a Layout-based Approach [J].
Esser, Daniel ;
Schuster, Daniel ;
Muthmann, Klemens ;
Berger, Michael ;
Schill, Alexander .
DOCUMENT RECOGNITION AND RETRIEVAL XIX, 2012, 8297
[7]  
Liu Q, 2016, Arxiv, DOI arXiv:1603.07704
[8]  
Liu Quan, 2017, AAAI SPRING S SERIES
[9]   Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces [J].
Lopez, Gustavo ;
Quesada, Luis ;
Guerrero, Luis A. .
ADVANCES IN HUMAN FACTORS AND SYSTEMS INTERACTION, 2018, 592 :241-250
[10]   A probabilistic approach to printed document understanding [J].
Medvet, Eric ;
Bartoli, Alberto ;
Davanzo, Giorgio .
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2011, 14 (04) :335-347