The influence of negative training set size on machine learning-based virtual screening

被引:64
作者
Kurczab, Rafal [1 ]
Smusz, Sabina [1 ,2 ]
Bojarski, Andrzej J. [1 ]
机构
[1] Polish Acad Sci, Inst Pharmacol, Dept Med Chem, PL-31343 Krakow, Poland
[2] Jagiellonian Univ, Fac Chem, PL-30060 Krakow, Poland
关键词
CLASSIFICATION; PERFORMANCE; CHEMISTRY; TOOL;
D O I
10.1186/1758-2946-6-32
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Background: The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. Results: The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naive Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naive Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. Conclusions: In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.
引用
收藏
页数:9
相关论文
共 24 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[3]   Advances in instance selection for instance-based learning algorithms [J].
Brighton, H ;
Mellish, C .
DATA MINING AND KNOWLEDGE DISCOVERY, 2002, 6 (02) :153-172
[4]   Contemporary QSAR classifiers compared [J].
Bruce, Craig L. ;
Melville, James L. ;
Pickett, Stephen D. ;
Hirst, Jonathan D. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (01) :219-227
[5]   Support vector inductive logic programming outperforms the naive Bayes classifier and inductive logic programming for the classification of bioactive chemical compounds [J].
Cannon, Edward O. ;
Amini, Ata ;
Bender, Andreas ;
Sternberg, Michael J. E. ;
Muggleton, Stephen H. ;
Glen, Robert C. ;
Mitchell, John B. O. .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2007, 21 (05) :269-280
[6]   Evaluation of machine-learning methods for ligand-based virtual screening [J].
Chen, Beining ;
Harrison, Robert F. ;
Papadatos, George ;
Willett, Peter ;
Wood, David J. ;
Lewell, Xiao Qing ;
Greenidge, Paulette ;
Stiefl, Nikolaus .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2007, 21 (1-3) :53-62
[7]  
Davis J., 2006, P 23 INT C MACH LEAR, P233, DOI [10.1145/1143844.1143874, DOI 10.1145/1143844.1143874]
[8]   ChEMBL: a large-scale bioactivity database for drug discovery [J].
Gaulton, Anna ;
Bellis, Louisa J. ;
Bento, A. Patricia ;
Chambers, Jon ;
Davies, Mark ;
Hersey, Anne ;
Light, Yvonne ;
McGlinchey, Shaun ;
Michalovich, David ;
Al-Lazikani, Bissan ;
Overington, John P. .
NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) :D1100-D1107
[9]   Comparison of Confirmed Inactive and Randomly Selected Compounds as Negative Training Examples in Support Vector Machine-Based Virtual Screening [J].
Heikamp, Kathrin ;
Bajorath, Juergen .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2013, 53 (07) :1595-1601
[10]   Benchmarking sets for molecular docking [J].
Huang, Niu ;
Shoichet, Brian K. ;
Irwin, John J. .
JOURNAL OF MEDICINAL CHEMISTRY, 2006, 49 (23) :6789-6801