Scalable Load Balancing for MapReduce-based Record Linkage

被引:0
作者
Yan, Wei [1 ]
Xue, Yuan [1 ]
Malin, Bradley [2 ]
机构
[1] Vanderbilt Univ, Dept Elect Engn & Comp Sci, 221 Kirkland Hall, Nashville, TN 37235 USA
[2] Vanderbilt Univ, Dept Biomed Informat, Nashville, TN USA
来源
2013 IEEE 32ND INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC) | 2013年
基金
美国国家卫生研究院;
关键词
Record Linkage; MapReduce; Scalability; Load Balance;
D O I
暂无
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Recent research has introduced load balancing schemes that are aware of the input data distribution (i.e., data profile) to mitigate data skew and fully exploit the parallel capability of the MapReduce framework to support record linkage. However, existing solutions face a significant scalability issue when applied to massive data sets with millions or billions of blocks (a basic unit in record linkage) because their data profiles can not be maintained precisely in an efficient manner. The goal of this paper is to introduce a profiling method based on the notion of a sketch, which allows for a compact scalable solution for maintaining block size statistics. In addition, we propose two load balancing algorithms to work over sketch-based profiles while solving the data skew problem associated with record linkage. We provide an analytical analysis and extensive experiments (using Hadoop), with real and controlled synthetic data sets, to illustrate the effectiveness of our solution. The experimental results show that our load balancing algorithms can decrease the overall job completion time by 71.56% and 70.73% of the default settings in Hadoop using a set of DBLP data sets, which have 2.5 to 50.4 million records.
引用
收藏
页数:10
相关论文
共 23 条