Addressing Class Imbalance in Non-Binary Classification Problems

被引:2
作者
Seliya, Naeem [1 ]
Xu, Zhiwei [1 ]
Khoshgoftaar, Taghi M. [2 ]
机构
[1] Univ Michigan, Comp Informat Sci, 4901 Evergreen Rd, Dearborn, MI 48128 USA
[2] Florida Atlantic Univ, Comp Sci Engn, Boca Raton, FL 33431 USA
来源
20TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL 1, PROCEEDINGS | 2008年
关键词
Machine learning; class imbalance; non-binary classifiers; data sampling; artificial intelligence;
D O I
10.1109/ICTAI.2008.120
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of class imbalance in machine learning is quite real and cumbersome when it comes to building a useful and practical classification model. We present a unique insight into addressing class imbalance for classification problems that involve three or more categories, i.e. non-binary. This study is different than related works in the literature because most works focus on addressing class imbalance only for binary classification problems, even if it means transforming a non-binary dataset into a binary classification problem. We propose an effective, yet simple approach to alleviating class imbalance issues when the classification problem involves more than two classes. The process, with four different methods, is based on applying random undersampling and random oversampling to different parts of the dataset for achieving better classification performance. The proposed data sampling methods are evaluated in the context of two real-world datasets obtained from the UCI Repository for Machine Learning Databases, and two commonly used classification algorithms: C4.5 and RIPPER. Our results demonstrate that the multi-group classification accuracy increases significantly in most cases after the proposed data sampling methods are applied. The positive outcome of this study motivates us to further our research on class imbalance and non-binary classification problems.
引用
收藏
页码:460 / +
页数:2
相关论文
共 17 条
[1]  
[Anonymous], 2005, Data Mining Pratical Machine Learning Tools and Techniques
[2]  
[Anonymous], 2004, ACM SIGKDD EXPLORATI, DOI DOI 10.1145/1007730.1007734
[3]  
[Anonymous], 2007, P 24 INT C MACH LEAR
[4]  
Barandela R, 2004, LECT NOTES COMPUT SC, V3138, P806
[5]  
Blake C.L., 1998, UCI repository of machine learning databases
[6]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[7]   Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning [J].
Han, H ;
Wang, WY ;
Mao, BH .
ADVANCES IN INTELLIGENT COMPUTING, PT 1, PROCEEDINGS, 2005, 3644 :878-887
[8]  
Jo T., 2004, ACM SIGKDD EXPLOR NE, V6, P40, DOI DOI 10.1145/1007730.1007737
[9]   Comparative assessment of software quality classification techniques: An empirical case study [J].
Khoshgoftaar, TM ;
Seliya, N .
EMPIRICAL SOFTWARE ENGINEERING, 2004, 9 (03) :229-257
[10]  
Kubat M, 1997, P 14 INT C MACH LEAR, P821