MapReduce-based entity matching with multiple blocking functions

被引:3
作者
Jin, Cheqing [1 ]
Chen, Jie [1 ]
Liu, Huiping [1 ]
机构
[1] East China Normal Univ, Inst Data Sci & Engn, Sch Comp Sci & Software Engn, Shanghai 200062, Peoples R China
基金
中国国家自然科学基金;
关键词
entity matching; MapReduce; load balancing; pair deduplication; RECORD LINKAGE;
D O I
10.1007/s11704-016-5346-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blocking-based method, tends to select a small proportion of record pairs for verification with a far lower cost than O(n (2)), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly. It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some complicated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called MrEm, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.
引用
收藏
页码:895 / 911
页数:17
相关论文
共 30 条
[1]  
[Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775116
[2]  
[Anonymous], 2003, P 9 ACM SIGKDD INT C, DOI DOI 10.1145/956750.956759
[3]  
[Anonymous], 2012, Hadoop: The definitive guide
[4]  
Baxter R., 2003, ACM SIGKDD 03 WORKSH, P25, DOI DOI 10.1007/978-3-319-11257-2
[5]   Swoosh: a generic approach to entity resolution [J].
Benjelloun, Omar ;
Garcia-Molina, Hector ;
Menestrina, David ;
Su, Qi ;
Whang, Steven Euijong ;
Widom, Jennifer .
VLDB JOURNAL, 2009, 18 (01) :255-276
[6]  
Bilenko M, 2006, IEEE DATA MINING, P87
[8]   ClusterJoin: A Similarity Joins Framework using Map-Reduce [J].
Das Sarma, Akash ;
He, Yeye ;
Chaudhuri, Surajit .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (12) :1059-1070
[9]   Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters [J].
De Vries, Timothy ;
Ke, Hui ;
Chawla, Sanjay ;
Christen, Peter .
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2011, 5 (02)
[10]   Mapreduce: Simplified data processing on large clusters [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113