Set similarity join on massive probabilistic data using MapReduce

被引:0
作者
Youzhong Ma
Xiaofeng Meng
机构
[1] Renmin University of China,School of Information
来源
Distributed and Parallel Databases | 2014年 / 32卷
关键词
Set similarity join; MapReduce; Probabilistic data;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, in this paper, we proposed two approaches using MapReduce to deal with this task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Pruning. Hadoop Join by Map Side Pruning uses the sum of the existence probability to filter out the probabilistic sets directly at the Map task side which have no any chance to be similar with any other probabilistic set. Hadoop Join by Reduce Side Pruning uses probability sum based pruning principle and probability upper bound based pruning principle to reduce the candidate pairs at Reduce task side, it can save the comparison cost. Based on the above approaches, we proposed a hybrid solution that employs both Map-side and Reduce-side pruning methods. Finally we implemented the above approaches on Hadoop-0.20.2 and performed comprehensive experiments to their performance, we also test the speedup ratio compared with the naive method: Block Nested Loop Join. The experiment results show that our approaches have much better performance than that of Block Nested Loop Join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with set similarity join on massive probabilistic data problem using MapReduce paradigm, and the approaches proposed in this paper provide a new way to process the massive probabilistic data.
引用
收藏
页码:447 / 464
页数:17
相关论文
共 12 条
[1]  
Broder S.C.(1997)Syntactic clustering of the web Comput. Netw. undefined undefined-undefined
[2]  
Glassman M.S.(2009)Data integration with uncertainty VLDB J. undefined undefined-undefined
[3]  
Manasse G.(2011)Efficient similarity joins for near-duplicate detection ACM Trans. Database Syst. undefined undefined-undefined
[4]  
Zweig X.L.(undefined)undefined undefined undefined undefined-undefined
[5]  
Dong A.Y.(undefined)undefined undefined undefined undefined-undefined
[6]  
Halevy C.(undefined)undefined undefined undefined undefined-undefined
[7]  
Yu C.(undefined)undefined undefined undefined undefined-undefined
[8]  
Xiao W.(undefined)undefined undefined undefined undefined-undefined
[9]  
Wang X.(undefined)undefined undefined undefined undefined-undefined
[10]  
Lin J.X.(undefined)undefined undefined undefined undefined-undefined