Data Profiling - A Tutorial

被引:28
作者
Abedjan, Ziawasch [1 ]
Golab, Lukasz [2 ]
Naumann, Felix [3 ]
机构
[1] TU Berlin, Berlin, Germany
[2] Univ Waterloo, Waterloo, ON, Canada
[3] Hasso Plattner Inst, Potsdam, Germany
来源
SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2017年
关键词
DATA QUALITY; DISCOVERY;
D O I
10.1145/3035918.3054772
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to systematic inference of structural information and statistics of a dataset using dedicated profiling tools. In this tutorial, we highlight the importance of data profiling as part of any data-related use-case, and we discuss the area of data profiling by classifying data profiling tasks and reviewing the state-of-the-art data profiling systems and techniques. In particular, we discuss hard problems in data profiling, such as algorithms for dependency discovery and profiling algorithms for dynamic data and streams. We also pay special attention to visualizing and interpreting the results of data profiling. We conclude with directions for future research in the area of data profiling. This tutorial is based on our survey on profiling relational data [2].
引用
收藏
页码:1747 / 1751
页数:5
相关论文
共 34 条
[11]  
Bauckmann J., 2007, Proc. of PhD Workshop in Conjunction with VLDB 2007, Vienna, P1448
[12]  
Bauckmann J., 2012, PROC C INF KNOWL MAN, P2094
[13]  
Bravo L., 2007, VLDB, P243
[14]  
Chu X, 2014, PROC INT CONF DATA, P1222, DOI 10.1109/ICDE.2014.6816746
[15]   Discovering Denial Constraints [J].
Chu, Xu ;
Ilyas, Ihab F. ;
Papotti, Paolo .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (13) :1498-1509
[16]   Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches [J].
Cormode, Graham ;
Garofalakis, Minos ;
Haas, Peter J. ;
Jermaine, Chris .
FOUNDATIONS AND TRENDS IN DATABASES, 2011, 4 (1-3) :1-294
[17]  
DASU T, 2006, IEEE DATA ENG B, V29, P43
[18]  
Deng D., 2017, P C INNN SYST RES CI
[19]  
GOLAB L, 2009, PVLDB, V2, P574
[20]   Data Auditor: Exploring Data Quality and Semantics using Pattern Tableaux [J].
Golab, Lukasz ;
Karloff, Howard ;
Korn, Flip ;
Srivastava, Divesh .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (02) :1641-1644