Classifiers and their Metrics Quantified

被引:47
作者
Brown, J. B. [1 ]
机构
[1] Kyoto Univ, Grad Sch Med, Lab Mol Biosci, Sakyo Ku, E-109 Konoemachi, Kyoto 6068501, Japan
关键词
Classifiers; metrics; prediction; modeling; performance assessment;
D O I
10.1002/minf.201700127
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Molecular modeling frequently constructs classification models for the prediction of two-class entities, such as compound bio(in)activity, chemical property (non)existence, protein (non)interaction, and so forth. The models are evaluated using well known metrics such as accuracy or true positive rates. However, these frequently used metrics applied to retrospective and/or artificially generated prediction datasets can potentially overestimate true performance in actual prospective experiments. Here, we systematically consider metric value surface generation as a consequence of data balance, and propose the computation of an inverse cumulative distribution function taken over a metric surface. The proposed distribution analysis can aid in the selection of metrics when formulating study design. In addition to theoretical analyses, a practical example in chemogenomic virtual screening highlights the care required in metric selection and interpretation.
引用
收藏
页数:11
相关论文
共 28 条
[1]   Ligand-Based Virtual Screening Using Bayesian Networks [J].
Abdo, Ammar ;
Chen, Beining ;
Mueller, Christoph ;
Salim, Naomie ;
Willett, Peter .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2010, 50 (06) :1012-1020
[2]  
Bajorath J, 2011, FUTURE MED CHEM, V3, P909, DOI [10.4155/fmc.11.57, 10.4155/FMC.11.57]
[3]   When is Chemical Similarity Significant? The Statistical Distribution of Chemical Similarity Scores and Its Extreme Values [J].
Baldi, Pierre ;
Nasr, Ramzi .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2010, 50 (07) :1205-1222
[4]   The ChEMBL bioactivity database: an update [J].
Bento, A. Patricia ;
Gaulton, Anna ;
Hersey, Anne ;
Bellis, Louisa J. ;
Chambers, Jon ;
Davies, Mark ;
Krueger, Felix A. ;
Light, Yvonne ;
Mak, Lora ;
McGlinchey, Shaun ;
Nowotka, Michal ;
Papadatos, George ;
Santos, Rita ;
Overington, John P. .
NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) :D1083-D1090
[5]   Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric [J].
Boughorbel, Sabri ;
Jarray, Fethi ;
El-Anbari, Mohammed .
PLOS ONE, 2017, 12 (06)
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   GLASS: a comprehensive database for experimentally validated GPCR-ligand associations [J].
Chan, Wallace K. B. ;
Zhang, Hongjiu ;
Yang, Jianyi ;
Brender, Jeffrey R. ;
Hur, Junguk ;
Ozgur, Arzucan ;
Zhang, Yang .
BIOINFORMATICS, 2015, 31 (18) :3035-3042
[8]   Evaluation of machine-learning methods for ligand-based virtual screening [J].
Chen, Beining ;
Harrison, Robert F. ;
Papadatos, George ;
Willett, Peter ;
Wood, David J. ;
Lewell, Xiao Qing ;
Greenidge, Paulette ;
Stiefl, Nikolaus .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2007, 21 (1-3) :53-62
[9]   Combining multiple classifications of chemical structures using consensus clustering [J].
Chu, Chia-Wei ;
Holliday, John D. ;
Willett, Peter .
BIOORGANIC & MEDICINAL CHEMISTRY, 2012, 20 (18) :5366-5371
[10]   Effect of Data Standardization on Chemical Clustering and Similarity Searching [J].
Chu, Chia-Wei ;
Holliday, John D. ;
Willett, Peter .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2009, 49 (02) :155-161