A Scalable Similarity Join Algorithm Based on MapReduce and LSH

被引:3
作者
Rivault, Sebastien [1 ]
Bamha, Mostafa [1 ]
Limet, Sebastien [1 ]
Robert, Sophie [1 ]
机构
[1] Univ Orleans, INSA Ctr Val Loire, EA, LIFO, F-4022 Orleans, France
关键词
Similarity join operations; Local sensitive hashing (LSH); MapReduce model; Data skew; Hadoop framework;
D O I
10.1007/s10766-022-00733-6
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold 2. In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Frechet distance on large datasets of trajectories from real world and synthetic data benchmarks.
引用
收藏
页码:360 / 380
页数:21
相关论文
共 23 条
  • [1] COMPUTING THE FRECHET DISTANCE BETWEEN 2 POLYGONAL CURVES
    ALT, H
    GODAU, M
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL GEOMETRY & APPLICATIONS, 1995, 5 (1-2) : 75 - 91
  • [2] A fast implementation of near neighbors queries for Frechet distance (GIS Cup)
    Baldus, Julian
    Bringmann, Karl
    [J]. 25TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2017), 2017,
  • [3] Bamha M., 2005, LNCS, V3588
  • [4] Bamha M., 2003, PARALLEL PROCESSING, V13, P317, DOI DOI 10.1142/S0129626403001306
  • [5] Blanas S., 2010, P ACM SIGMOD INT C M, P975, DOI [DOI 10.1145/1807167.1807273, 10]
  • [6] Why walking the dog takes time: Frechet distance has no strongly subquadratic algorithms unless SETH fails
    Bringmann, Karl
    [J]. 2014 55TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS 2014), 2014, : 661 - 670
  • [7] Four Soviets Walk the Dog: Improved Bounds for Computing the Frechet Distance
    Buchin, Kevin
    Buchin, Maike
    Meulemans, Wouter
    Mulzer, Wolfgang
    [J]. DISCRETE & COMPUTATIONAL GEOMETRY, 2017, 58 (01) : 180 - 216
  • [8] FRESH: Frechet Similarity with Hashing
    Ceccarello, Matteo
    Driemel, Anne
    Silvestri, Francesco
    [J]. ALGORITHMS AND DATA STRUCTURES, WADS 2019, 2019, 11646 : 254 - 268
  • [9] Mapreduce: Simplified data processing on large clusters
    Dean, Jeffrey
    Ghemawat, Sanjay
    [J]. COMMUNICATIONS OF THE ACM, 2008, 51 (01) : 107 - 113
  • [10] Driemel A., 2017, P INT S COMP GEOM, DOI [10.4230/LIPIcs.SoCG.2017.37, DOI 10.4230/LIPICS.SOCG.2017.37]