Profiling relational data: a survey

被引:154
作者
Abedjan, Ziawasch [1 ]
Golab, Lukasz [2 ]
Naumann, Felix [3 ]
机构
[1] MIT CSAIL, Cambridge, MA USA
[2] Univ Waterloo, Waterloo, ON N2L 3G1, Canada
[3] Hasso Plattner Inst, Potsdam, Germany
关键词
FUNCTIONAL-DEPENDENCIES; DATA QUALITY; INCLUSION DEPENDENCIES; EFFICIENT ALGORITHM; DISCOVERY; ATTRIBUTE; PATTERNS; SYNOPSES; KEYS;
D O I
10.1007/s00778-015-0389-y
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.
引用
收藏
页码:557 / 581
页数:25
相关论文
共 144 条
[1]  
Abedjan Z., 2012, PROC 21 ACM INT C IN, P1532
[2]  
Abedjan Z., 2014, P 23 ACM INT C INF K, P949, DOI 10.1145/2661829.2661884
[3]  
Abedjan Z, 2014, PROC INT CONF DATA, P1198, DOI 10.1109/ICDE.2014.6816740
[4]  
Abedjan Z, 2014, PROC INT CONF DATA, P1036, DOI 10.1109/ICDE.2014.6816721
[5]  
Abedjan Ziawasch., 2011, Proceedings of the 20th ACM international conference on Information and knowledge management, P1565
[6]  
Agrawal R., P 20 INT C VERY LARG
[7]  
[Anonymous], P C INN DAT SYST RES
[8]  
[Anonymous], 2006, P 25 ACM SIGMOD SIGA
[9]  
[Anonymous], 2006, VLDB
[10]  
[Anonymous], P INT WORKSH QUAL DA