K Nearest Neighbour Joins for Big Data on MapReduce: A Theoretical and Experimental Analysis

被引:33
作者
Song, Ge [1 ]
Rochas, Justine [4 ]
El Beze, Lea [4 ]
Huet, Fabrice [4 ]
Magoules, Frederic [2 ,3 ]
机构
[1] Univ Paris Saclay, CentraleSupelec, Dept Math & Comp Sci, St Aubin, France
[2] Univ Paris Saclay, CentraleSupelec, Dept Math, St Aubin, France
[3] Univ Paris Saclay, CentraleSupelec, Dept Comp Sci, St Aubin, France
[4] Univ Nice Sophia Antipolis, CNRS, I3S, UMR 7271, F-06900 Sophia Antipolis, France
关键词
kNN; mapreduce; performance evaluation;
D O I
10.1109/TKDE.2016.2562627
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a computational intensive task with a large range of applications such as knowledge discovery or data mining. However, as the volume and the dimension of data increase, only distributed approaches can perform such costly operation in a reasonable time. Recent works have focused on implementing efficient solutions using the MapReduce programming model because it is suitable for distributed large scale data processing. Although these works provide different solutions to the same problem, each one has particular constraints and properties. In this paper, we compare the different existing approaches for computing kNN on MapReduce, first theoretically, and then by performing an extensive experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN computation on MapReduce: data pre-processing, data partitioning, and computation. We then analyze each step from load balancing, accuracy, and complexity aspects. Experiments in this paper use a variety of datasets, and analyze the impact of data volume, data dimension, and the value of k from many perspectives like time and space complexity, and accuracy. The experimental part brings new advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this is the first paper that compares kNN computing methods on MapReduce both theoretically and experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based practical problems in the context of big data.
引用
收藏
页码:2376 / 2392
页数:17
相关论文
共 34 条
  • [1] Agrawal R., 1993, Foundations of Data Organization and Algorithms. 4th International Conference. FODO '93 Proceedings, P69
  • [2] Andreica M. I., 2013, ACTA U APULENSIS MAT, V29, P131
  • [3] [Anonymous], 1997, P 1997 ACM SIGMOD IN
  • [4] [Anonymous], 2008, 11 INT WORKSHOP WEB
  • [5] [Anonymous], 2010, LARGE SCALE DISTRIB
  • [6] [Anonymous], 1998, GEOINFORMATICA, DOI DOI 10.1023/A:1009755931056
  • [7] GPU-FS-kNN: A Software Tool for Fast and Scalable kNN Computation Using GPUs
    Arefin, Ahmed Shamsul
    Riveros, Carlos
    Berretta, Regina
    Moscato, Pablo
    [J]. PLOS ONE, 2012, 7 (08):
  • [8] Collaborative Personalized Top-k Processing
    Bai, Xiao
    Guerraoui, Rachid
    Kermarrec, Anne-Marie
    Leroy, Vincent
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2011, 36 (04):
  • [9] SURF: Speeded up robust features
    Bay, Herbert
    Tuytelaars, Tinne
    Van Gool, Luc
    [J]. COMPUTER VISION - ECCV 2006 , PT 1, PROCEEDINGS, 2006, 3951 : 404 - 417
  • [10] MULTIDIMENSIONAL BINARY SEARCH TREES USED FOR ASSOCIATIVE SEARCHING
    BENTLEY, JL
    [J]. COMMUNICATIONS OF THE ACM, 1975, 18 (09) : 509 - 517