Optimizing classification efficiency with machine learning techniques for pattern matching

被引：0

作者：

Belal A. Hamed

Osman Ali Sadek Ibrahim

Tarek Abd El-Hafeez

机构：

[1] Minia University,Department of Computer Science, Faculty of Science

[2] Deraya University,Computer Science Unit

来源：

Journal of Big Data | / 10卷

关键词：

Bioinformatics; Feature extraction; Pattern matching; Machine learning; DNA sequences;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The study proposes a novel model for DNA sequence classification that combines machine learning methods and a pattern-matching algorithm. This model aims to effectively categorize DNA sequences based on their features and enhance the accuracy and efficiency of DNA sequence classification. The performance of the proposed model is evaluated using various machine learning algorithms, and the results indicate that the SVM linear classifier achieves the highest accuracy and F1 score among the tested algorithms. This finding suggests that the proposed model can provide better overall performance than other algorithms in DNA sequence classification. In addition, the proposed model is compared to two suggested algorithms, namely FLPM and PAPM, and the results show that the proposed model outperforms these algorithms in terms of accuracy and efficiency. The study further explores the impact of pattern length on the accuracy and time complexity of each algorithm. The results show that as the pattern length increases, the execution time of each algorithm varies. For a pattern length of 5, SVM Linear and EFLPM have the lowest execution time of 0.0035 s. However, at a pattern length of 25, SVM Linear has the lowest execution time of 0.0012 s. The experimental results of the proposed model show that SVM Linear has the highest accuracy and F1 score among the tested algorithms. SVM Linear achieved an accuracy of 0.963 and an F1 score of 0.97, indicating that it can provide the best overall performance in DNA sequence classification. Naive Bayes also performs well with an accuracy of 0.838 and an F1 score of 0.94. The proposed model offers a valuable contribution to the field of DNA sequence analysis by providing a novel approach to pre-processing and feature extraction. The model’s potential applications include drug discovery, personalized medicine, and disease diagnosis. The study’s findings highlight the importance of considering the impact of pattern length on the accuracy and time complexity of DNA sequence classification algorithms.

引用

共 84 条

[1]

Liu PJFiG(2022)New Intraclass Helitrons classification using DNA-Image sequences and machine learning approaches Pan-cancer DNA methylation analysis and tumor origin identification of carcinoma of unknown primary site based on multi-omics 12 798748-64

[2]

Zhao F(2023)Fast string matching for DNA sequences HExpPredict: In Vivo Exposure Prediction of Human Blood Exposome Using a Random Forest Model and Its Application in Chemical Risk Prioritization 131 037009-48

[3]

Li L(2021)L.J.F.i.B. Zhang, and Biotechnology IRBM 42 154-52

[4]

Lin P(2020)undefined Theor Comput Sci 812 137-35

[5]

Chen Y(2020)undefined DNA similarity search with access control over encrypted cloud data 10 1233-48

[6]

Xing S(2020)undefined Rev application Mach Learn algorithms Seq data Min DNA 8 1032-6

[7]

Du H(2022)undefined DeLUCS: Deep learning for unsupervised clustering of DNA sequences 17 e0261531-29

[8]

Wang Z(2022)undefined Mol convolutional neural networks DNA Regul circuits 4 625-67

[9]

Yang J(2022)undefined Analytics of machine learning-based algorithms for text classification 3 238-88

[10]

Huan T(2023)undefined Mach Learn detecting DNA attachment SPR Biosens 13 3742-undefined

← 1 2 3 4 5 6 7 8 9 →