Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents

被引:13
作者
Sarkhel, Ritesh [1 ]
Nandi, Arnab [1 ]
机构
[1] Ohio State Univ, Columbus, OH 43210 USA
来源
SIGMOD '19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2019年
关键词
Visually Rich document; Information Extraction; Named entity; RECOGNITION; !text type='HTML']HTML[!/text;
D O I
10.1145/3299869.3319867
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Physical and digital documents often contain visually rich information. With such information, there is no strict ordering or positioning in the document where the data values must appear. Along with textual cues, these documents often also rely on salient visual features to define distinct semantic boundaries and augment the information they disseminate. When performing information extraction (IE), traditional techniques fall short, as they use a text-only representation and do not consider the visual cues inherent to the layout of these documents. We propose VS2, a generalized approach for information extraction from heterogeneous visually rich documents. There are two major contributions of this work. First, we propose a robust segmentation algorithm that decomposes a visually rich document into a bag of visually isolated but semantically coherent areas, called logical blocks. Document type agnostic low-level visual and semantic features are used in this process. Our second contribution is a distantly supervised search-and-select method for identifying the named entities within these documents by utilizing the context boundaries defined by these logical blocks. Experimental results on three heterogeneous datasets suggest that the proposed approach significantly outperforms its text-only counterparts on all datasets. Comparing it against the state-of-the-art methods also reveal that VS2 performs comparably or better on all datasets.
引用
收藏
页码:247 / 262
页数:16
相关论文
共 48 条
[1]  
[Anonymous], 2009, Introduction to Algorithms
[2]  
Apostolova Emilia., 2014, EMNLP, P1924
[3]  
Astera LLC, 2018, REPORTMINER DAT EXTR
[4]  
Banerjee S., 2002, Computational Linguistics and Intelligent Text Processing. Third International Conference, CICLing 2002. Proceedings (Lecture Notes in Computer Science Vol.2276), P136
[5]  
Cai D., 2003, Vips: a vision-based page segmentation algorithm
[6]  
Chang AX, 2012, LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3735
[7]   Feature Mining for Localised Crowd Counting [J].
Chen, Ke ;
Loy, Chen Change ;
Gong, Shaogang ;
Xiang, Tao .
PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2012, 2012,
[8]  
Chiticariu L., 2010, SIGMOD 10 P 2010 INT, P1257
[9]  
Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109
[10]  
Del Corro L., 2013, P 22 INT C WORLD WID, P355, DOI [DOI 10.1145/2488388.2488420, /10.1145/2488388.2488420]