HEADS-JOIN: Efficient Earth Mover's Distance Similarity Joins on Hadoop

被引:16
作者
Huang, Jin [1 ]
Zhang, Rui [1 ]
Buyya, Rajkumar [1 ]
Chen, Jian [2 ]
Wu, Yongwei [3 ]
机构
[1] Univ Melbourne, Dept Informat & Comp Syst, Melbourne, Vic 3010, Australia
[2] S China Univ Technol, Sch Software Engn, Guangzhou 510641, Guangdong, Peoples R China
[3] Tsinghua Univ, Tsinghua Natl Lab Informat Sci & Technol, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
基金
澳大利亚研究理事会;
关键词
Earth Mover's distance; similarity join; MapReduce; bulk synchronous parallel;
D O I
10.1109/TPDS.2015.2462354
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Earth Mover's Distance (EMD) similarity join has a number of important applications such as near duplicate image retrieval and distributed based pattern analysis. However, the computational cost of EMD is super cubic and consequently the EMD similarity join operation is prohibitive for datasets of even medium size. We propose to employ the Hadoop platform to speed up the operation. Simply porting the state-of-the-art metric distance similarity join algorithms to Hadoop results in inefficiency because they involve excessive distance computations and are vulnerable to skewed data distributions. We propose a novel framework, named HEADS-JOIN, which transforms data into the space of EMD lower bounds and performs pruning and partitioning at a low cost because computing these EMD lower bounds has constant or linear complexity. We investigate both range and top-k joins, and design efficient algorithms on three popular Hadoop computation paradigms, i.e., MapReduce, Bulk Synchronous Parallel, and Spark. We conduct extensive experiments on both real and synthetic datasets. The results show that HEADS-JOIN outperforms the state-of-the-art metric similarity join technique, i.e., Quickjoin, by up to an order of magnitude and scales out well.
引用
收藏
页码:1660 / 1673
页数:14
相关论文
共 30 条
  • [1] A motion-aware approach to continuous retrieval of 3D objects
    Ali, Mohammed Eunus
    Zhang, Rui
    Tanin, Egemen
    Kulik, Lars
    [J]. 2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 843 - +
  • [2] [Anonymous], 2010, PROC ACM SIGMM INT C
  • [3] [Anonymous], 2012, NSDI
  • [4] [Anonymous], 2010, P ACM SIGMOD INT C M, DOI DOI 10.1145/1807167.1807273
  • [5] [Anonymous], 1997, The earth mover's distance: Lower bounds and invariance under translation
  • [6] Applegate D., 2011, P 17 ACM SIGKDD INT, P636, DOI DOI 10.1145/2020408.2020508
  • [7] Assent I., 2006, ICDE Conference, P11
  • [8] Bamha M., 2003, PARALLEL PROCESSING, V13, P317, DOI DOI 10.1142/S0129626403001306
  • [9] Bertin-Mahieux T., 2011, P 12 INT SOC MUS INF, V2, P10, DOI DOI 10.7916/D8NZ8J07
  • [10] Culler D. E., 1993, PROTABILITY PERFORMA