GPU-based similarity metrics computation and machine learning approaches for string similarity evaluation in large datasets

被引:0
|
作者
Aurel Baloi
Bogdan Belean
Flaviu Turcu
Daniel Peptenatu
机构
[1] University of Bucharest,Research Center for Integrated Analysis and Territorial Management
[2] Intergraph Computer Services,Department
[3] National Institute for Research and Development of Isotopic and Molecular Technologies,Center for Research and Advanced Technologies for Alternative Energie
来源
Soft Computing | 2024年 / 28卷
关键词
String similarity score; String matching; Parallel computation; GPU; CUDA kernel;
D O I
暂无
中图分类号
学科分类号
摘要
The digital era brings up on one hand massive amounts of available data and on the other hand the need of parallel computing architectures for efficient data processing. String similarity evaluation is a processing task applied on large data volumes, commonly performed by various applications such as search engines, biomedical data analysis and even software tools for defending against viruses, spyware, or spam. String similarities are also used in musical industry for matching playlist records with repertory records composed of song titles, performer artists and producers names, aiming to assure copyright protection of mass-media broadcast materials. The present paper proposes a novel GPU-based approach for parallel implementation of the Jaro–Winkler string similarity metric computation, broadly used for matching strings over large datasets. The proposed implementation is applied in musical industry for matching playlist with over 100k records with a given repertory which includes a collection of over 1 million right owner records. The global GPU RAM memory is used to store multiple string lines representing repertory records, whereas single playlist string comparisons with the raw data are performed using the maximum number of available GPU threads and the stride operations. Further on, the accuracy of the Jaro–Winkler approach for the string matching procedure is increased using both an adaptive neural network approach guided by a novelty detection classifier (aNN) and a multiple-features neural network implementation (MF-NN). Thus, the aNN approach yielded an accuracy of 92% while the MF-NN approach achieved an accuracy of 99% at the cost of increased computational complexity. Timing considerations and the computational complexity are detailed for the proposed approaches compared with both the general-purpose processor (CPU) implementation and the state-of-the-art GPU approaches. A speed-up factor of 21.6 was obtained for the GPU-based Jaro–Winkler implementation compared with the CPU one, whereas a factor of 3.72 was obtained compared with the existing GPU implementation of string matching procedure based on Levenstein distance metrics.
引用
收藏
页码:3465 / 3477
页数:12
相关论文
共 50 条
  • [1] GPU-based similarity metrics computation and machine learning approaches for string similarity evaluation in large datasets
    Baloi, Aurel
    Belean, Bogdan
    Turcu, Flaviu
    Peptenatu, Daniel
    SOFT COMPUTING, 2024, 28 (04) : 3465 - 3477
  • [2] GPU-Based Medical Visualization for Large Datasets
    Zou, Hue
    Lin, Fu
    Han, Jie
    Zhang, Wen
    JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS, 2015, 5 (07) : 1467 - 1473
  • [3] fgssjoin: A GPU-based Algorithm for Set Similarity Joins
    Quirino, Rafael D.
    Junior, Sidney R.
    Ribeiro, Leonardo A.
    Martins, Wellington S.
    ICEIS: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS - VOL 1, 2017, : 152 - 161
  • [4] Machine learning using synthetic and real data: Similarity of evaluation metrics for different healthcare datasets and for different algorithms
    Heyburn, Rachel
    Bond, Raymond R.
    Black, Michaela
    Mulvenna, Maurice
    Wallace, Jonathan
    Rankin, Deborah
    Cleland, Brian
    DATA SCIENCE AND KNOWLEDGE ENGINEERING FOR SENSING DECISION SUPPORT, 2018, 11 : 1281 - 1291
  • [5] A comparative evaluation of string similarity metrics for ontology alignment
    Sun, Yufei
    Ma, Liangli
    Wang, Shuang
    Journal of Information and Computational Science, 2015, 12 (03): : 957 - 964
  • [6] Deep Learning Approaches for Similarity Computation: A Survey
    Yang, Peilun
    Wang, Hanchen
    Yang, Jianye
    Qian, Zhengping
    Zhang, Ying
    Lin, Xuemin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 7893 - 7912
  • [7] Machine Learning Metrics for Network Datasets Evaluation
    Soukup, Dominik
    Uhricek, Daniel
    Vasata, Daniel
    Cejka, Tomas
    ICT SYSTEMS SECURITY AND PRIVACY PROTECTION, IFIP SEC 2023, 2024, 679 : 307 - 320
  • [8] Learning to combine multiple string similarity metrics for effective toponym matching
    Santos, Rui
    Murrieta-Flores, Patricia
    Martins, Bruno
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2018, 11 (09) : 913 - 938
  • [9] Arabic sentence similarity based on similarity features and machine learning
    Alian, Marwah
    Awajan, Arafa
    SOFT COMPUTING, 2021, 25 (15) : 10089 - 10101
  • [10] Arabic sentence similarity based on similarity features and machine learning
    Marwah Alian
    Arafa Awajan
    Soft Computing, 2021, 25 : 10089 - 10101