Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets

被引:11
作者
Shigarov, Alexey O. [1 ]
Paramonov, Viacheslav V. [1 ]
Belykh, Polina V. [1 ]
Bondarev, Alexander I. [1 ]
机构
[1] RAS, Matrosov Inst Syst Dynam & Control Theory, SB, Irkutsk, Russia
来源
INFORMATION AND SOFTWARE TECHNOLOGIES, ICIST 2016 | 2016年 / 639卷
基金
俄罗斯基础研究基金会;
关键词
Unstructured data integration; Table understanding; Table analysis and interpretation; Spreadsheet data transformation; ONTOLOGY GENERATION; WEB;
D O I
10.1007/978-3-319-46254-7_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Arbitrary tables presented in spreadsheets can be an important data source in business intelligence. However, many of them have complex layouts that hinder the process of extracting, transforming, and loading their data in a database. The paper is devoted to the issues of rule-based data transformation from arbitrary tables presented in spreadsheets to a structured canonical form that can be loaded into a database by regular ETL-tools. We propose a system for canonicalization of arbitrary tables presented in spreadsheets as an implementation of our methodology for rule-based table analysis and interpretation. It enables the execution of rules expressed in our specialized rule language called CRL to recover implicit relationships in a table. Our experimental results show that particular CRL-programs can be developed for different sets of tables with similar features to automate table canonicalization with high accuracy.
引用
收藏
页码:78 / 91
页数:14
相关论文
共 28 条
[1]   UCheck: A spreadsheet type checker for end users [J].
Abraham, Robin ;
Erwig, Martin .
JOURNAL OF VISUAL LANGUAGES AND COMPUTING, 2007, 18 (01) :71-95
[2]  
[Anonymous], 2009, UNSTRUCTURED INFORM
[3]  
Astredinova N. V., 2013, P 2 ALL RUSS C YOUNG, P14
[4]   WebTables: Exploring the Power of Tables on the Web [J].
Cafarella, Michael J. ;
Halevy, Alon ;
Wang, Daisy Zhe ;
Wu, Eugene ;
Zhang, Yang .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01) :538-549
[5]   Automatic detection of dimension errors in spreadsheets [J].
Chambers, Chris ;
Erwig, Martin .
JOURNAL OF VISUAL LANGUAGES AND COMPUTING, 2009, 20 (04) :269-283
[6]  
Chen Z., 2013, P 3 INT WORKSH SEM S, P1, DOI DOI 10.1145/2509908.2509909
[7]   Integrating Spreadsheet Data via Accurate and Low-Effort Extraction [J].
Chen, Zhe ;
Cafarella, Michael .
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, :1126-1135
[8]  
Cunha Jacome., 2009, PEPM, P179, DOI [10.1145/1480945.1480972, DOI 10.1145/1480945.1480972]
[9]   Converting heterogeneous statistical tables on the web to searchable databases [J].
Embley, David W. ;
Krishnamoorthy, Mukkai S. ;
Nagy, George ;
Seth, Sharad .
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2016, 19 (02) :119-138
[10]   Transforming web tables to a relational database [J].
Embley, David W. ;
Nagy, George ;
Seth, Sharad .
2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, :2781-2786