GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning

被引:97
作者
Esposito, Carmen [1 ]
Landrum, Gregory A. [1 ,2 ]
Schneider, Nadine [3 ]
Stiefl, Nikolaus [3 ]
Riniker, Sereina [1 ]
机构
[1] Swiss Fed Inst Technol, Lab Phys Chem, CH-8093 Zurich, Switzerland
[2] T5 Informat GmbH, CH-4055 Basel, Switzerland
[3] Novartis Pharma AG, Novartis Inst BioMed Res, Novartis Campus, CH-4002 Basel, Switzerland
关键词
BINARY CLASSIFICATION; CONFORMAL PREDICTION; VALIDATION; AGREEMENT; ZINC;
D O I
10.1021/acs.jcim.1c00160
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
引用
收藏
页码:2623 / 2640
页数:18
相关论文
共 96 条
[81]   Applying Mondrian Cross-Conformal Prediction To Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets [J].
Sun, Jiangming ;
Carlsson, Lars ;
Ahlberg, Ernst ;
Norinder, Ulf ;
Engkvist, Ola ;
Chen, Hongming .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2017, 57 (07) :1591-1598
[82]   CLASSIFICATION OF IMBALANCED DATA: A REVIEW [J].
Sun, Yanmin ;
Wong, Andrew K. C. ;
Kamel, Mohamed S. .
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2009, 23 (04) :687-719
[83]   Neyman-Pearson classification algorithms and NP receiver operating characteristics [J].
Tong, Xin ;
Feng, Yang ;
Li, Jingyi Jessica .
SCIENCE ADVANCES, 2018, 4 (02)
[84]   SciPy 1.0: fundamental algorithms for scientific computing in Python']Python [J].
Virtanen, Pauli ;
Gommers, Ralf ;
Oliphant, Travis E. ;
Haberland, Matt ;
Reddy, Tyler ;
Cournapeau, David ;
Burovski, Evgeni ;
Peterson, Pearu ;
Weckesser, Warren ;
Bright, Jonathan ;
van der Walt, Stefan J. ;
Brett, Matthew ;
Wilson, Joshua ;
Millman, K. Jarrod ;
Mayorov, Nikolay ;
Nelson, Andrew R. J. ;
Jones, Eric ;
Kern, Robert ;
Larson, Eric ;
Carey, C. J. ;
Polat, Ilhan ;
Feng, Yu ;
Moore, Eric W. ;
VanderPlas, Jake ;
Laxalde, Denis ;
Perktold, Josef ;
Cimrman, Robert ;
Henriksen, Ian ;
Quintero, E. A. ;
Harris, Charles R. ;
Archibald, Anne M. ;
Ribeiro, Antonio H. ;
Pedregosa, Fabian ;
van Mulbregt, Paul .
NATURE METHODS, 2020, 17 (03) :261-272
[85]  
Vovk Vladimir, 2005, Algorithmic Learning in a Random World, P17, DOI [DOI 10.1007/B106715, DOI 10.1007/0-387-25061-12]
[86]  
Wang B X., 2004, P IRIS MACHINE LEARN
[87]   PubChem's BioAssay Database [J].
Wang, Yanli ;
Xiao, Jewen ;
Suzek, Tugba O. ;
Zhang, Jian ;
Wang, Jiyao ;
Zhou, Zhigang ;
Han, Lianyi ;
Karapetyan, Karen ;
Dracheva, Svetlana ;
Shoemaker, Benjamin A. ;
Bolton, Evan ;
Gindulyte, Asta ;
Bryant, Stephen H. .
NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) :D400-D412
[88]   Combating the Small Sample Class Imbalance Problem Using Feature Selection [J].
Wasikowski, Mike ;
Chen, Xue-wen .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2010, 22 (10) :1388-1400
[89]   Effective detection of sophisticated online banking fraud on extremely imbalanced data [J].
Wei, Wei ;
Li, Jinjiu ;
Cao, Longbing ;
Ou, Yuming ;
Chen, Jiahang .
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2013, 16 (04) :449-475
[90]  
Weiss G.M., 2004, ACM SIGKDD Explorations Newsletter, V6, P7, DOI [10.1145/1007730.1007734, DOI 10.1145/1007730.1007734]