Box Drawings for Learning with Imbalanced Data

被引:20
作者
Goh, Siong Thye [1 ]
Rudin, Cynthia [2 ,3 ]
机构
[1] MIT, Operat Res Ctr, Cambridge, MA 02139 USA
[2] MIT, CSAIL, Cambridge, MA 02142 USA
[3] MIT, Sloan Sch Management, Cambridge, MA 02142 USA
来源
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14) | 2014年
关键词
Classification; Imbalanced Data; Decision Trees; CLASSIFICATION; ALGORITHMS; ROBUST;
D O I
10.1145/2623330.2623648
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The vast majority of real world classification problems are imbalanced, meaning there are far fewer data from the class of interest (the positive class) than from other classes. We propose two machine learning algorithms to handle highly irnbalanced classification problems. The classifiers are disjunction of conjunctions, and are created as unions of parallel axis rectangles around the positive examples, and thus have the benefit of being interpretable. The first algorithm uses mixed integer programming to optimize a weighted balance between positive and negative class accuracies. Regularization is introduced to improve generalization performance. The second method uses an approximation in order to assist with scalability. Specifically, it follows a characterize then discriminate approach, where the positive class is characterized first by boxes, and then each box boundary becomes a separate discriminative classifier. 'This method has the computational advantages that it can be easily parallelized, and considers only the relevant regions of feature space.
引用
收藏
页码:333 / 342
页数:10
相关论文
共 22 条
[1]  
Abe N, 2003, P ICML WORKSH LEARN
[2]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[3]  
[Anonymous], 2003, P ICML 2003 WORKSH L
[4]  
[Anonymous], 2011, ACM T INTEL SYST TEC, DOI DOI 10.1145/1961189.1961199
[5]  
Chawla N. V., 2004, ACM SIGKDD Explorations Newsletter, V6, P1
[6]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[7]   Hellinger distance decision trees are robust and skew-insensitive [J].
Cieslak, David A. ;
Hoens, T. Ryan ;
Chawla, Nitesh V. ;
Kegelmeyer, W. Philip .
DATA MINING AND KNOWLEDGE DISCOVERY, 2012, 24 (01) :136-158
[8]   Bump hunting in high-dimensional data [J].
Friedman J.H. ;
Fisher N.I. .
Statistics and Computing, 1999, 9 (2) :123-143
[9]   Learning from Imbalanced Data [J].
He, Haibo ;
Garcia, Edwardo A. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (09) :1263-1284
[10]   VERY SIMPLE CLASSIFICATION RULES PERFORM WELL ON MOST COMMONLY USED DATASETS [J].
HOLTE, RC .
MACHINE LEARNING, 1993, 11 (01) :63-91