A Scalable Similarity Join Algorithm Based on MapReduce and LSH

被引:0
作者
Sébastien Rivault
Mostafa Bamha
Sébastien Limet
Sophie Robert
机构
[1] Université Orléans,
[2] INSA Centre Val de Loire,undefined
[3] LIFO,undefined
[4] EA,undefined
来源
International Journal of Parallel Programming | 2022年 / 50卷
关键词
Similarity join operations; Local sensitive hashing (LSH); MapReduce model; Data skew; Hadoop framework;
D O I
暂无
中图分类号
学科分类号
摘要
Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold λ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document}. In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Fréchet distance on large datasets of trajectories from real world and synthetic data benchmarks.
引用
收藏
页码:360 / 380
页数:20
相关论文
共 35 条
  • [1] Alt H(1995)Computing the fréchet distance between two polygonal curves Int. J. Comput. Geomet. Appl. 05 75-91
  • [2] Godau M(2003)Pipelining a skew-insensitive parallel join algorithm Parallel Process. Lett. 13 317-328
  • [3] Bamha M(2017)Four soviets walk the dog: Improved bounds for computing the fréchet distance Discret. Comput. Geomet. 58 180-216
  • [4] Exbrayat M(2008)Mapreduce: simplified data processing on large clusters Commun. ACM 51 107-113
  • [5] Buchin K(2012)Approximating the fréchet distance for realistic curves in near linear time Discret. Comput. Geomet. 48 94-127
  • [6] Buchin M(1950)Human behaviour and the principle of least effort Econ. J. 60 808-810
  • [7] Meulemans W(2015)Towards scalability and data skew handling in groupby-joins using mapreduce model Procedia Comput. Sci. 51 70-79
  • [8] Mulzer W(2014)Handling data-skew effects in join operations using mapreduce Procedia Comput. Sci. 29 145-158
  • [9] Dean J(2017)Visual analytics of delays and interaction in movement data Int. J. Geogr. Inf. Sci. 31 320-345
  • [10] Ghemawat S(2012)V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors Proc. VLDB Endow. 5 704-715