Precision-recall curve (PRC) classification trees

被引:92
作者
Miao, Jiaju [1 ]
Zhu, Wei [1 ]
机构
[1] SUNY Stony Brook, Dept Appl Math & Stat, Stony Brook, NY 11794 USA
关键词
Precision– recall curve; Imbalanced data; PRC random forest; PRC– ROC random forest; IMBALANCED DATA;
D O I
10.1007/s12065-021-00565-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The classification of imbalanced data has presented a significant challenge for most well-known classification algorithms that were often designed for data with relatively balanced class distributions. Nevertheless skewed class distribution is a common feature in real world problems. It is especially prevalent in certain application domains with great need for machine learning and better predictive analysis such as disease diagnosis, fraud detection, bankruptcy prediction, and suspect identification. In this paper, we propose a novel tree-based algorithm based on the area under the precision-recall curve for variable selection in the classification context. Our algorithm, named as the "Precision-Recall curve classification tree", or simply the "PRC classification tree" modifies two crucial stages in tree building. The first stage is to maximize the area under the precision-recall curve in node variable selection. The second stage is to maximize the harmonic mean of recall and precision (F-measure) for threshold selection. We found the proposed PRC classification tree, and its subsequent extension, the PRC random forest, work well especially for class-imbalanced data sets. We have demonstrated that our methods outperform their classic counterparts, the usual CART and random forest for both synthetic and real data. Furthermore, the ROC classification tree proposed by our group previously, based on the area under the ROC curve, has shown good performance in imbalanced data. Their combination, the PRC-ROC tree, has also shows great promise in identifying the minority class.
引用
收藏
页码:1545 / 1569
页数:25
相关论文
共 40 条
[1]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]  
Alpaydin E., 2010, INTRO MACHINE LEARNI, V1191, P68485
[3]  
Boyd Kendrick, 2013, Machine Learning and Knowledge Discovery in Databases. European Conference, ECML PKDD 2013. Proceedings: LNCS 8190, P451, DOI 10.1007/978-3-642-40994-3_29
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Neural network method for failure detection with skewed class distribution [J].
Carvajal, K ;
Chacón, M ;
Mery, D ;
Acuña, G .
INSIGHT, 2004, 46 (07) :399-402
[6]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[7]   SMOTEBoost: Improving prediction of the minority class in boosting [J].
Chawla, NV ;
Lazarevic, A ;
Hall, LO ;
Bowyer, KW .
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 :107-119
[8]  
Chen C., 2004, University of California, Berkeley
[9]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[10]  
Debray, 1992, P JOINT INT C S LOG, P654