DeExcelerator: A Framework for Extracting Relational Data From Partially Structured Documents

被引:21
作者
Eberius, Julian [1 ]
Werner, Christoper [1 ]
Thiele, Maik [1 ]
Braunschweig, Katrin [1 ]
Dannecker, Lars [2 ]
Lehner, Wolfgang [1 ]
机构
[1] Tech Univ Dresden, Database Technol Grp, Dresden, Germany
[2] SAP AG, Dresden, Germany
来源
PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13) | 2013年
关键词
Spreadsheets; Normalization; Extracting Relational Tables;
D O I
10.1145/2505515.250821
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data. gov , but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data. gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.
引用
收藏
页数:3
相关论文
共 6 条
[1]  
[Anonymous], 2008, P 11 INT WORKSH WEB
[2]  
Braunschweig K., 2012, LIMITS CURRENT OPEN, V1
[3]  
Huynh David., Google Refine
[4]  
Jingjing Wang, 2012, Conceptual Modeling. Proceedings 31st International Conference, ER 2012, P141, DOI 10.1007/978-3-642-34002-4_11
[5]  
Kandel S, 2011, 29TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, P3363
[6]   A survey of table recognition: Models, observations, transformations, and inferences [J].
Zanibbi R. ;
Blostein D. ;
Cordy J.R. .
Document Analysis and Recognition, 2004, 7 (1) :1-16