CLAMS: Bringing Quality to Data Lakes

被引:45
作者
Farid, Mina [1 ]
Roatis, Alexandra [1 ]
Ilyas, Ihab F. [1 ]
Hoffmann, Hella-Franziska [1 ,2 ,3 ]
Chu, Xu [1 ]
机构
[1] Univ Waterloo, Waterloo, ON, Canada
[2] Thomson Reuters, Philadelphia, PA 19130 USA
[3] Univ Waterloo, Waterloo, ON, Canada
来源
SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2016年
关键词
D O I
10.1145/2882903.2899391
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as "data lakes", and with the recent development of multiple technologies to support this "load-first" paradigm, the new environment presents serious data management challenges. Among them, the assessment of data quality and cleaning large volumes of heterogeneous data sources become essential tasks in unveiling the value of big data. The coveted use of unstructured and semi-structured data in large volumes makes current data cleaning tools (primarily designed for relational data) not directly adoptable. We present CLAMS, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples). This demonstration shows how CLAMS is able to discover the constraints and the schemas they are defined on simultaneously. CLAMS also introduces a scale-out solution to efficiently detect errors in the raw data. CLAMS interacts with human experts to both validate the discovered constraints and to suggest data repairs. CLAMS has been deployed in a real large-scale enterprise data lake and was experimented with a real data set of 1.2 billion triples. It has been able to spot multiple obscure data inconsistencies and errors early in the data processing stack, providing huge value to the enterprise.
引用
收藏
页码:2089 / 2092
页数:4
相关论文
共 5 条
[1]  
[Anonymous], 2013, ICDE
[2]  
Chalamalla Anup., 2014, SIGMOD
[3]  
Chaudhuri S., 1997, SIGMOD Record, V26, P65, DOI 10.1145/248603.248616
[4]   Discovering Denial Constraints [J].
Chu, Xu ;
Ilyas, Ihab F. ;
Papotti, Paolo .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (13) :1498-1509
[5]  
Ilyas IF, 2012, FOUND TRENDS DATABAS, V5, P282, DOI 10.1561/1900000045