Comparative Study of Machine Learning Techniques for Genome Scale Discrimination of Recombinant HIV-1 Strains

被引:4
作者
Dwivedi, Ashok Kumar [1 ]
Chouhan, Usha [1 ]
机构
[1] Maulana Azad Natl Inst Technol, Dept Bioinformat Math & Comp Applicat, Bhopal 462003, Madhya Pradesh, India
关键词
HIV-1; Machine Learning; Classification; Recombinant; Non Recombinant; NEURAL-NETWORKS; KNOWLEDGE;
D O I
10.1166/jmihi.2016.1699
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The whole genomes of HIV-1 strains were analyzed for discriminating genomes of circulated recombinant forms from other non-recombinant genomes using naive bays, logistic regression, support vector machine, k-nearest neighbor and classification tree using codon frequencies as sequence attributes. The performance of all five techniques were compared on different indices like, classification accuracy, sensitivity, specificity, Matthews's correlation coefficient and brier score. Moreover the techniques were compared using receiver-operating curves and on calibration graphs for their calibration ability. All techniques were validated using tenfold cross validation and evaluated on training data sets, comprising 4215 genomes, including 3004 non-recombinant strains, and 1211 circulating recombinant strains. Highest classification accuracy of 94.47% were achieved using K-nearest neighbor on tenfold cross validation. Moreover, classification accuracy of 84.49%, 88.28%, 92.22%, 86.31% were achieved using Naive Bayes, Logistic Regression, Support Vector Machine and Classification Trees respectively, on tenfold cross validation. Furthermore, on receiver operating curve k-Nearest Neighbor performed best by having area under the curve near to one (0.9754). Our results indicates that supervised machine learning techniques can effectively applied for the efficient discrimination of recombinant strains of HIV-1 from nonrecombinant strains at genome scale using frequency of codons.
引用
收藏
页码:425 / 430
页数:6
相关论文
共 36 条
[1]  
Aha D., 1997, Lazy learning
[2]   INSTANCE-BASED LEARNING ALGORITHMS [J].
AHA, DW ;
KIBLER, D ;
ALBERT, MK .
MACHINE LEARNING, 1991, 6 (01) :37-66
[3]  
[Anonymous], 2000, NATURE STAT LEARNING, DOI DOI 10.1007/978-1-4757-3264-1
[4]  
Baldi P., 2001, Bioinformatics: The Machine Learning Approach
[5]   Machine learning in bioinformatics: A brief survey and recommendations for practitioners [J].
Bhaskar, Harish ;
Hoyle, David C. ;
Singh, Sameer .
COMPUTERS IN BIOLOGY AND MEDICINE, 2006, 36 (10) :1104-1125
[6]  
Briesmeister J. F., 2000, MCNP 4B MONTE CARLO, V4, P1997
[7]   Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267
[8]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[9]  
Castillo E., 1997, Expert Systems and Probabilistic Network Models, V493, P543
[10]   SIGNAL DETECTABILITY - THE USE OF ROC CURVES AND THEIR ANALYSES [J].
CENTOR, RM .
MEDICAL DECISION MAKING, 1991, 11 (02) :102-106