Distributed Similarity Join Over Data Streams Based on Earth Mover's Distance

被引:0
|
作者
Xu J. [1 ,2 ,3 ]
Song C. [1 ]
Lv P. [1 ,2 ,3 ]
Li T.-S. [1 ,2 ]
机构
[1] College of Computer, Electronics and Information, Guangxi University, Nanning
[2] Guangxi Colleges and Universities Key Laboratory of Parallel and Distributed Computing, Nanning
[3] Guangxi Key Laboratory of Multimedia Communications Network Technology, Nanning
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2019年 / 42卷 / 08期
基金
中国国家自然科学基金;
关键词
Apache Storm framework; Data locality; Data stream; Earth mover's distance; Similarity join;
D O I
10.11897/SP.J.1016.2019.01779
中图分类号
学科分类号
摘要
With the continuous progress of data acquisition equipment and the rapid development of data acquisition technologies, analyzing and mining quick-generated data streams from practical applications become an urgent problem to solve. Similarity join over data streams, which returns all similar pairs of tuples from two data streams, is an important operation for analyzing and mining data streams. Compared with Lp norm distance functions, such as Manhattan distance and Euclidean distance, Earth Mover's Distance(EMD) is more robust in quantifying the similarities of histogram-representative tuples and thus receives widespread attentions. EMD is widely applied to solve many important application problems, such as content-based image retrieval, duplicated image recognition and video object tracking. However, the computation of EMD owns cubic time complexity, which hampers the application of EMD in solving similarity join problem over data streams. Based on Apache Storm, which is an open source distributed parallel data stream processing framework, we design and implement a distributed similarity join technique for data streams based on EMD, named EMD-DDSJ. This technique partitions histogram-representative tuples in data streams based on their similarities in terms of EMD and thus maintains good data locality on each join-computing node. It is the good data locality that enhances the filtering effect of joining algorithm to unpromising EMD calculations between dissimilar histogram-representative tuple pairs, and therefore improves the execution efficiency of join-computing nodes. On the basis of the cost model of join-computing nodes, we also propose an effective load balancing strategyon the strength of feedback messages sent by join-computing nodes, which improves the overall performance of the EMD-DDSJ technique. Experimental results on physical-world datasets show the efficiency and good scalability of the proposed EMD-DDSJ technique. Particularly, the processing throughput of EMD-DDSJ technique is at most 1.4 times higher than that of the best related technique and the average tuple processing delay of EMD-DDSJ technique decreases up to 44% compared with the best related technique. Besides, with the growth of the similarity threshold or the size of sliding window, the throughput improvement ratio of EMD-DDSJ technique compared to the related technique will be further increased. © 2019, Science Press. All right reserved.
引用
收藏
页码:1779 / 1796
页数:17
相关论文
共 25 条
  • [1] Liu P., Guo J.M., Chamnongthai K., Prasetyo H., Fusion of color histogram and LBP-based features for texture image retrieval and classification, Information Sciences, 390, pp. 95-111, (2017)
  • [2] Deshpande A., Guestrin C., Madden S.R., Et al., Model-based approximate querying in sensor networks, The VLDB Journal, 14, 4, pp. 417-443, (2005)
  • [3] Lian X., Chen L., Efficient join processing on uncertain data streams, IEEE Transactions on Knowledge& Data Engineering, 23, 11, pp. 1718-1734, (2009)
  • [4] Rubner Y., Tomasi C., Guibas L.J., The earth mover's distance as a metric for image retrieval, International Journal of Computer Vision, 40, 2, pp. 99-121, (2000)
  • [5] Nirmala S.G., Kiran K.P., Earth mover's distance-based CBIR using adaptive regularized Kernel fuzzy C-means method of liver cirrhosis histopathological segmentation, International Journal of Signal and Imaging Systems Engineering, 10, 1-2, (2017)
  • [6] Xu J., Bai Q., Gu Y., Tung A.K., Eudemon: A system for online video frame copy detection by earth mover's distance, Proceedings of the 28th International Conference on Data Engineering, pp. 1233-1236, (2012)
  • [7] Xu J., Lei B., Gu Y., Winslett M., Et al., Efficient similarity join based on earth mover's distance using MapReduce, IEEE Transactions on Knowledge& Data Engineering, 27, 8, pp. 2148-2162, (2015)
  • [8] Zhao Q., Yang Z., Tao H., Differential earth mover's distance with its applications to visual tracking, IEEE Transactions on Pattern Analysis & Machine Intelligence, 32, 2, pp. 274-287, (2010)
  • [9] Huang J., Zhang R., Buyya R., Chen J., MELODY-JOIN: Efficient earth mover's distance similarity joins using MapReduce, Proceedings of the 28th International Conference on Data Engineering, pp. 808-819, (2014)
  • [10] Wichterich M., Assent I., Kranen P., Et al., Efficient EMD-based similarity search in multimedia databases via flexible dimensionality reduction, Proceedings of the 2008 ACM SIGMOD/PODS Conference, pp. 199-212, (2008)