Splicing sites prediction of human genome using machine learning techniques

被引:0
作者
Waseem Ullah
Khan Muhammad
Ijaz Ul Haq
Amin Ullah
Saeed Ullah Khattak
Muhammad Sajjad
机构
[1] Sejong University,Intelligent Media Laboratory
[2] Sejong University,Visual Analytics for Knowledge Laboratory, Department of Software
[3] University of Peshawar,Centre of Biotechnology and Microbiology
[4] Islamia College Peshawar,Department of Computer Science
来源
Multimedia Tools and Applications | 2021年 / 80卷
关键词
Biomedical data; Big data analysis; Computer-aided diagnosis; Genomics; Machine learning; Pattern recognition; Splicing sites;
D O I
暂无
中图分类号
学科分类号
摘要
The accurate splice site prediction has several applications in the field of medical sciences and biochemistry. For instance, any mutation affecting the splice site will lead to genetic diseases and cancer such as Lynch syndrome and breast cancer. For this purpose, collecting the Ribonucleic Acid (RNA) samples is an efficient and convenient method to detect the involvement of splicing defects in disease formation. Therefore, the present study aims to develop an accurate and robust Computer-Aided Diagnosis (CAD) method for swift and precise targeting of splice site sequences. A composite features-based model is proposed by integrating three different sample representation methods i.e., Dinucleotide Composition (DNC), Trinucleotide Composition (TNC) and Tetranucleotide Composition (TetraNC) for precise splice site prediction after converting the DNA sequences into numerical descriptors. The precision and accuracy of these features are analyzed by applying different machine learning algorithms such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Naïve Bayes (NB). Results show that the proposed model of composite features vector with SVM classifier achieved an accuracy of 95.20% and 97.50% for donor and acceptor sites datasets, respectively.
引用
收藏
页码:30439 / 30460
页数:21
相关论文
共 270 条
[51]  
Vapnik V(2017)Splice site identification in human genome using random forest Heal Technol 7 141-323
[52]  
Cui Y(2001)GeneSplicer: a new computational method for splice site prediction Nucleic Acids Res 29 1185-3005
[53]  
Han J(2002)HS3D, a dataset of homo sapiens splice regions, and its extraction procedure from a major public database Int J Mod Phys C 13 1105-2593
[54]  
Zhong D(2014)iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components Int J Mol Sci 15 1746-75
[55]  
Liu R(2019)FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data Methods 166 40-87
[56]  
Du P(1997)Improved splice site detection in genie J Comput Biol 4 311-327
[57]  
Gu S(2018)Hereditary cancer genes are highly susceptible to splicing mutations PLoS Genet 14 2994-69
[58]  
Jiao Y(2001)Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements Nucleic Acids Res 29 2587-379
[59]  
Feng P-M(2016)iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC Mol BioSyst 12 69-13066
[60]  
Chen W(2017)Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou's trinucleotide composition Comput Methods Prog Biomed 146 70-1111