PHiDJ: Parallel Similarity Self-Join for High-Dimensional Vector Data with MapReduce

被引:0
作者
Fries, Sergej [1 ]
Boden, Brigitte [1 ]
Stepien, Grzegorz [1 ]
Seidl, Thomas [1 ]
机构
[1] Rhein Westfal TH Aachen, Dept Comp Sci 9, D-52056 Aachen, Germany
来源
2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE) | 2014年
关键词
ALGORITHM;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Join processing on large-scale vector data is an important problem in many applications, as vectors are a common representation for various data types. Especially, several data analysis tasks like near duplicate detection, density-based clustering or data cleaning are based on similarity self-joins, which are a special type of join. For huge data sets, MapReduce proved to be a suitable, error-tolerant framework for parallel join algorithms. Recent approaches exploit the vector-space properties for low-dimensional vector data for an efficient join computation. However, so far no parallel similarity self-join approaches aiming at high-dimensional vector data were proposed. In this work we propose the novel similarity self-join algorithm PHiDJ (Parallel High-Dimensional Join) for the MapReduce framework. PHiDJ is well suited for medium to high-dimensional data and exploits multiple filter techniques for reducing communication and computational costs. We provide a solution for efficient join computation for skewed distributed data. Our experimental evaluation on medium- to high-dimensional data shows that our approach outperforms existing techniques.
引用
收藏
页码:796 / 807
页数:12
相关论文
共 26 条
  • [1] Afrati F.N., 2012, ICDE
  • [2] Massively Parallel Data Analysis with PACTs on Nephele
    Alexandrov, Alexander
    Heimel, Max
    Markl, Volker
    Battre, Dominic
    Hueske, Fabian
    Nijkamp, Erik
    Ewen, Stephan
    Kao, Odej
    Warneke, Daniel
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (02): : 1625 - 1628
  • [3] [Anonymous], 2010, EDBT, DOI [DOI 10.1145/1739041.1739056, 10.1145/1739041.1739056]
  • [4] [Anonymous], 2009, Proceedings of the VLDB Endowment
  • [5] [Anonymous], 2010, P ACM SIGMOD INT C M, DOI DOI 10.1145/1807167.1807273
  • [6] [Anonymous], 1996, ACM SIGMOD RECORD
  • [7] Baraglia Ranieri, 2010, Proceedings 2010 10th IEEE International Conference on Data Mining (ICDM 2010), P731, DOI 10.1109/ICDM.2010.70
  • [8] Beecks Christian., 2010, ACM International Conference on Image and Video Retrieval, P438, DOI [10.1145/1816041.1816105, DOI 10.1145/1816041.1816105]
  • [9] Böhm C, 2001, SIGMOD RECORD, V30, P379, DOI 10.1145/376284.375714
  • [10] Date C.J., 2006, RELATIONAL DATABASE