GPU-based similarity metrics computation and machine learning approaches for string similarity evaluation in large datasets

被引：0

作者：

Aurel Baloi

Bogdan Belean

Flaviu Turcu

Daniel Peptenatu

机构：

[1] University of Bucharest,Research Center for Integrated Analysis and Territorial Management

[2] Intergraph Computer Services,Department

[3] National Institute for Research and Development of Isotopic and Molecular Technologies,Center for Research and Advanced Technologies for Alternative Energie

来源：

Soft Computing | 2024年 / 28卷

关键词：

String similarity score; String matching; Parallel computation; GPU; CUDA kernel;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The digital era brings up on one hand massive amounts of available data and on the other hand the need of parallel computing architectures for efficient data processing. String similarity evaluation is a processing task applied on large data volumes, commonly performed by various applications such as search engines, biomedical data analysis and even software tools for defending against viruses, spyware, or spam. String similarities are also used in musical industry for matching playlist records with repertory records composed of song titles, performer artists and producers names, aiming to assure copyright protection of mass-media broadcast materials. The present paper proposes a novel GPU-based approach for parallel implementation of the Jaro–Winkler string similarity metric computation, broadly used for matching strings over large datasets. The proposed implementation is applied in musical industry for matching playlist with over 100k records with a given repertory which includes a collection of over 1 million right owner records. The global GPU RAM memory is used to store multiple string lines representing repertory records, whereas single playlist string comparisons with the raw data are performed using the maximum number of available GPU threads and the stride operations. Further on, the accuracy of the Jaro–Winkler approach for the string matching procedure is increased using both an adaptive neural network approach guided by a novelty detection classifier (aNN) and a multiple-features neural network implementation (MF-NN). Thus, the aNN approach yielded an accuracy of 92% while the MF-NN approach achieved an accuracy of 99% at the cost of increased computational complexity. Timing considerations and the computational complexity are detailed for the proposed approaches compared with both the general-purpose processor (CPU) implementation and the state-of-the-art GPU approaches. A speed-up factor of 21.6 was obtained for the GPU-based Jaro–Winkler implementation compared with the CPU one, whereas a factor of 3.72 was obtained compared with the existing GPU implementation of string matching procedure based on Levenstein distance metrics.

引用

页码：3465 / 3477

页数：12

共 50 条

[1] GPU-based similarity metrics computation and machine learning approaches for string similarity evaluation in large datasets
Baloi, Aurel
Belean, Bogdan
Turcu, Flaviu
Peptenatu, Daniel
SOFT COMPUTING, 2024, 28 (04) : 3465 - 3477
[2] GPU-Based Medical Visualization for Large Datasets
Zou, Hue
Lin, Fu
Han, Jie
Zhang, Wen
JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS, 2015, 5 (07) : 1467 - 1473
[3] fgssjoin: A GPU-based Algorithm for Set Similarity Joins
Quirino, Rafael D.
Junior, Sidney R.
Ribeiro, Leonardo A.
Martins, Wellington S.
ICEIS: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS - VOL 1, 2017, : 152 - 161
[4] Machine learning using synthetic and real data: Similarity of evaluation metrics for different healthcare datasets and for different algorithms
Heyburn, Rachel
Bond, Raymond R.
Black, Michaela
Mulvenna, Maurice
Wallace, Jonathan
Rankin, Deborah
Cleland, Brian
DATA SCIENCE AND KNOWLEDGE ENGINEERING FOR SENSING DECISION SUPPORT, 2018, 11 : 1281 - 1291
[5] A comparative evaluation of string similarity metrics for ontology alignment
Sun, Yufei
Ma, Liangli
Wang, Shuang
Journal of Information and Computational Science, 2015, 12 (03): : 957 - 964
[6] Deep Learning Approaches for Similarity Computation: A Survey
Yang, Peilun
Wang, Hanchen
Yang, Jianye
Qian, Zhengping
Zhang, Ying
Lin, Xuemin
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 7893 - 7912
[7] Machine Learning Metrics for Network Datasets Evaluation
Soukup, Dominik
Uhricek, Daniel
Vasata, Daniel
Cejka, Tomas
ICT SYSTEMS SECURITY AND PRIVACY PROTECTION, IFIP SEC 2023, 2024, 679 : 307 - 320
[8] Learning to combine multiple string similarity metrics for effective toponym matching
Santos, Rui
Murrieta-Flores, Patricia
Martins, Bruno
INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2018, 11 (09) : 913 - 938
[9] Arabic sentence similarity based on similarity features and machine learning
Alian, Marwah
Awajan, Arafa
SOFT COMPUTING, 2021, 25 (15) : 10089 - 10101
[10] Arabic sentence similarity based on similarity features and machine learning
Marwah Alian
Arafa Awajan
Soft Computing, 2021, 25 : 10089 - 10101

← 1 2 3 4 5 →