A Fine-Grained Distribution Approach for ETL Processes in Big Data Environments

被引:15
作者
Bala, Mahfoud [1 ]
Boussaid, Omar [2 ]
Alimazighi, Zaia [3 ]
机构
[1] Saad Dahleb Univ, Dept Informat, Blida 1, Blida, Algeria
[2] Univ Lyon 2, Lyon, France
[3] USTHB, Dept Informat, Algiers, Algeria
关键词
Data Warehousing; ETL; Parallel and Distributed Processing; Big Data; MapReduce; MAPREDUCE; MODEL;
D O I
10.1016/j.datak.2017.08.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Among the so-called "4Vs" (volume, velocity, variety, and veracity) that characterize the complexity of Big Data, this paper focuses on the issue of "Volume" in order to ensure good performance for Extracting-Transforming-Loading (ETL) processes. In this study, we propose a new fine-grained parallelization/distribution approach for populating the Data Warehouse (DW). Unlike prior approaches that distribute the ETL only at coarse-grained level of processing, our approach provides different ways of parallelization/distribution both at process, functionality and elementary functions levels. In our approach, an ETL process is described in terms of its core functionalities which can run on a cluster of computers according to the Map Reduce (MR) paradigm. The novel approach allows thereby the distribution of the ETL process at three levels: the "process" level for coarse-grained distribution and the "functionality" and "elementary functions" levels for fine-grained distribution. Our performance analysis reveals that employing 25 to 38 parallel tasks enables the novel approach to speed up the ETL process by up to 33% with the improvement rate being linear.
引用
收藏
页码:114 / 136
页数:23
相关论文
共 28 条
[1]  
[Anonymous], 2011, P ACM
[2]  
[Anonymous], 2011, 6 INT C
[3]  
[Anonymous], 2012, Hadoop: The definitive guide
[4]   Reo: a channel-based coordination model for component composition [J].
Arbab, F .
MATHEMATICAL STRUCTURES IN COMPUTER SCIENCE, 2004, 14 (03) :329-366
[5]   Extracting-Transforming-Loading Modeling Approach for Big Data Analytics [J].
Bala, Mahfoud ;
Boussaid, Omar ;
Alimazighi, Zaia .
INTERNATIONAL JOURNAL OF DECISION SUPPORT SYSTEM TECHNOLOGY, 2016, 8 (04) :50-69
[6]  
Bala M, 2014, I C COMP SYST APPLIC, P42, DOI 10.1109/AICCSA.2014.7073177
[7]  
Balazs M., 2015, P 19 INT NOND TEST E, P462
[8]   Big Data: A Survey [J].
Chen, Min ;
Mao, Shiwen ;
Liu, Yunhao .
MOBILE NETWORKS & APPLICATIONS, 2014, 19 (02) :171-209
[9]   Mapreduce: Simplified data processing on large clusters [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113
[10]   MapReduce: A Flexible Data Processing Tool [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2010, 53 (01) :72-77