DeExcelerator: A Framework for Extracting Relational Data From Partially Structured Documents

被引：21

作者：

Eberius, Julian ^{[1
]}

Werner, Christoper ^{[1
]}

Thiele, Maik ^{[1
]}

Braunschweig, Katrin ^{[1
]}

Dannecker, Lars ^{[2
]}

Lehner, Wolfgang ^{[1
]}

机构：

[1] Tech Univ Dresden, Database Technol Grp, Dresden, Germany

[2] SAP AG, Dresden, Germany

来源：

PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13) | 2013年

关键词：

Spreadsheets; Normalization; Extracting Relational Tables;

D O I：

10.1145/2505515.250821

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data. gov , but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data. gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.

引用

页数：3

共 6 条

[1]

[Anonymous], 2008, P 11 INT WORKSH WEB

[2]

Braunschweig K., 2012, LIMITS CURRENT OPEN, V1

[3]

Huynh David., Google Refine

[4]

Jingjing Wang, 2012, Conceptual Modeling. Proceedings 31st International Conference, ER 2012, P141, DOI 10.1007/978-3-642-34002-4_11

[5]

Kandel S, 2011, 29TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, P3363

[6] A survey of table recognition: Models, observations, transformations, and inferences [J].

Zanibbi R. ;

Blostein D. ;

Cordy J.R. .

Document Analysis and Recognition, 2004, 7 (1) :1-16

← 1 →