Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture

被引:0
|
作者
Xu K. [1 ]
Nie T. [1 ]
Shen D. [1 ]
Kou Y. [1 ]
Yu G. [1 ]
机构
[1] School of Computer Science and Engineering, Northeastern University, Shenyang
来源
Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2021年 / 58卷 / 03期
关键词
Filter-verification framework; GPU parallel processing; Heterogeneous architecture; Inverted index; Similarity join;
D O I
10.7544/issn1000-1239.2021.20190567
中图分类号
学科分类号
摘要
Similarity join is an important task in data cleaning, data integration and other fields, which has attracted extensive attention in recent years. With the increasing amount of data, the improvement of real-time processing requirement and the bottleneck of CPU performance improvement, the traditional serial algorithms of similarity join have been unable to meet the requirement of current big data processing. As a co-processor, GPU has achieved good acceleration results in machine learning and other fields in recent years. It is of great practical significance to study the parallel similarity join algorithms based on GPU. This paper proposes a parallel similarity join algorithm based on CPU-GPU heterogeneous architecture. Firstly, GPU is used to construct inverted index based on SoA (struct of arrays), which solves the problem of low efficiency of traditional index structure in parallel reading and writing. Then, to address the performance problem of serial algorithms, a parallel dual-length filtering algorithm based on filter-verification framework is proposed. Inverted index and prefix filtering algorithm are used to further improve the filtering performance. And in our approach, the calculation for exact similarity verification is performed by CPU to make full use of heterogeneous computing resources of CPU-GPU. Finally, experiments are carried out on several datasets. Compared with the serial similarity join algorithms, the results show that our proposed algorithms have better filtering performance and lower index generation time than existing algorithms, and also have better processing performance and higher speedup ratio on the similarity join. © 2021, Science Press. All right reserved.
引用
收藏
页码:598 / 608
页数:10
相关论文
共 17 条
  • [1] Li Chen, Lu Jiaheng, Lu Yiming, Et al., Efficient merging and filtering algorithms for approximate string searches, Proc of IEEE Int Conf on Data Engineering, pp. 257-266, (2008)
  • [2] Xiao Cuan, Wang Wei, Lin Xuemin, Ed-Join: An efficient algorithm for similarity joins with edit distance constraints, Proceedings of the VLDB Endowment, 1, 1, pp. 933-944, (2008)
  • [3] Li Guoliang, Deng Dong, Wang Jiannan, Et al., PASS-JOIN: A partition-based method for similarity joins, Proceedings of the VLDB Endowment, 5, 3, pp. 253-264, (2011)
  • [4] Rong Chuitian, Lu Wei, Wang Xiaoli, Et al., Efficient and scalable processing of string similarity join, IEEE Transactions on Knowledge and Data Engineering, 25, 10, pp. 2217-2230, (2013)
  • [5] Cui Jia, Wang Weiping, Meng Dan, Et al., Continuous similarity join on data streams, Proc of IEEE Int Conf on Parallel & Distributed Systems, pp. 552-559, (2015)
  • [6] Wang Jiaying, Yang Xiaochun, Wang Bin, Et al., LS-Join: Local similarity join on string collections, IEEE Transactions on Knowledge and Data Engineering, 29, 9, pp. 1-12, (2017)
  • [7] Zhao Weijie, Rusu F, Dong Bin, Et al., Similarity join over array data, Proc of the 2016 ACM SIGMOD Int Conf on Management of Data, pp. 2007-2022, (2016)
  • [8] Patil M, Shah R., Similarity joins for uncertain strings, Proc of the 2016 ACM SIGMOD Int Conf on Management of Data, pp. 551-562, (2014)
  • [9] Chen Gang, Yang Keyu, Chen Lu, Et al., Metric similarity joins using MapReduce, IEEE Transactions on Knowledge & Data Engineering, 29, 3, pp. 656-669, (2017)
  • [10] Deng Shizhuo, Xin Junchang, Nie Tiezheng, Et al., Big data similarity join processing based on prefix-suffix filtering, Journal of Frontiers of Computer Science and Technology, 11, 8, pp. 49-59, (2017)