Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms

被引:6
作者
Zhu, Guanghui [1 ]
Wang, Qian [1 ]
Tang, Qiwei [1 ]
Gu, Rong [1 ]
Yuan, Chunfeng [1 ]
Huang, Yihua [1 ]
机构
[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210008, Jiangsu, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Distributed databases; Scalability; Remuneration; Lattices; Distributed algorithms; Switches; Query processing; Functional dependency discovery; distributed computing; data-parallel algorithms; ALGORITHM;
D O I
10.1109/TPDS.2019.2925014
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Functional dependencies (FDs) play a very important role in many data management tasks such as schema normalization, data cleaning, and query optimization. Meanwhile, there are ever-increasing application demands for efficient FD discovery on large-scale datasets. Unfortunately, due to huge runtime and memory overhead, the existing single-machine FD discovery algorithms are inefficient for large-scale datasets. Recently, distributed data-parallel computing has become the de facto standard for large-scale data processing. However, it is challenging to design an efficient distributed FD discovery algorithm. In this paper, we present SmartFD, which is an efficient and scalable algorithm for distributed FD discovery. First, we propose a novel attribute sorting-based algorithm framework. Next, to discover all the FDs grouped by a given attribute, we propose an efficient distributed algorithm Attribute-centric Functional Dependency Discovery (AFDD). In AFDD, we design an Fast Sampling and Early Aggregation (FSEA) mechanism to improve the efficiency of distributed sampling and propose a memory-efficient index-based method for distributed FD validation. Moreover, AFDD employs an attribute-parallel method to accelerate the pruning-and-generation of candidate FDs. Furthermore, we propose an adaptive switching strategy between distributed sampling and distributed validation based on the unified time-based efficiency metric. Also, we employ a distributed probing based method to make the switching strategy more accurate. Experimental results on Apache Spark reveal that SmartFD outperforms the state-of-the-art single-machine algorithm HyFD and the existing distributed algorithm HFDD with 3.2 & x00D7;-44.9 & x00D7; and 2.5 & x00D7;-455.7 & x00D7; speedup respectively. Moreover, SmartFD achieves good row scalability and column scalability. Additionally, SmartFD has sub-linear node scalability.
引用
收藏
页码:2663 / 2676
页数:14
相关论文
共 29 条
[1]  
Abedjan Ziawasch, 2014, P 23 ACM INT C C INF, P949, DOI [10.1145/2661829.2661884, DOI 10.1145/2661829.2661884]
[2]  
Agrawal R., 1994, P 20 INT C VER LARG, V1215, P487
[3]   Skew in Parallel Query Processing [J].
Beame, Paul ;
Koutris, Paraschos ;
Suciu, Dan .
PODS'14: PROCEEDINGS OF THE 33RD ACM SIGMOD-SIGACT-SIGART SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, 2014, :212-223
[4]   Approximate Discovery of Functional Dependencies for Large Datasets [J].
Bleifuss, Tobias ;
Buelow, Susanne ;
Frohnhofen, Johannes ;
Risch, Julian ;
Wiese, Georg ;
Kruse, Sebastian ;
Papenbrock, Thorsten ;
Naumann, Felix .
CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, :1803-1812
[5]  
Bohannon P., 2007, P 23 INT C DAT ENG I, P746, DOI [DOI 10.1109/ICDE.2007.367920, 10.1109/ICDE.2007.367920]
[6]   An Effective Syntax for Bounded Relational Queries [J].
Cao, Yang ;
Fan, Wenfei .
SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, :599-614
[7]  
Codd E. F., 1972, DATA BASE SYSTEMS RJ, Vrj909
[8]   PARTITION SEMANTICS FOR RELATIONS [J].
COSMADAKIS, SS ;
KANELLAKIS, PC ;
SPYRATOS, N .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1986, 33 (02) :203-233
[9]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[10]  
Flach PA, 1999, AI COMMUN, V12, P139