Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem

被引:57
作者
Lu, Yang [1 ]
Cheung, Yiu-Ming [2 ]
Tang, Yuan Yan [3 ,4 ]
机构
[1] Xiamen Univ, Sch Informat, Key Lab Sensing & Comp Smart City, Xiamen 361005, Peoples R China
[2] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Peoples R China
[3] City Univ, Community Coll, UOW Coll Hong Kong, Fac Sci & Technol, Hong Kong, Peoples R China
[4] Univ Macau, Fac Sci & Technol, Dept Comp & Informat Sci, Macau, Peoples R China
基金
中国国家自然科学基金;
关键词
Indexes; Complexity theory; Learning systems; Training; Benchmark testing; Technological innovation; Computer science; Bayes classifier; class imbalance learning; data complexity; imbalance measure; imbalance recovery methods; SAMPLING METHOD; SMOTE; COMPLEXITY; NOISY;
D O I
10.1109/TNNLS.2019.2944962
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent studies of imbalanced data classification have shown that the imbalance ratio (IR) is not the only cause of performance loss in a classifier, as other data factors, such as small disjuncts, noise, and overlapping, can also make the problem difficult. The relationship between the IR and other data factors has been demonstrated, but to the best of our knowledge, there is no measurement of the extent to which class imbalance influences the classification performance of imbalanced data. In addition, it is also unknown which data factor serves as the main barrier for classification in a data set. In this article, we focus on the Bayes optimal classifier and examine the influence of class imbalance from a theoretical perspective. We propose an instance measure called the Individual Bayes Imbalance Impact Index (IBI3) and a data measure called the Bayes Imbalance Impact Index (BI3). IBI3 and BI3 reflect the extent of influence using only the imbalance factor, in terms of each minority class sample and the whole data set, respectively. Therefore, IBI3 can be used as an instance complexity measure of imbalance and BI3 as a criterion to demonstrate the degree to which imbalance deteriorates the classification of a data set. We can, therefore, use BI3 to access whether it is worth using imbalance recovery methods, such as sampling or cost-sensitive methods, to recover the performance loss of a classifier. The experiments show that IBI3 is highly consistent with the increase of the prediction score obtained by the imbalance recovery methods and that BI3 is highly consistent with the improvement in the F1 score obtained by the imbalance recovery methods on both synthetic and real benchmark data sets.
引用
收藏
页码:3525 / 3539
页数:15
相关论文
共 40 条
[1]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[2]   Measurement of Data Complexity for Classification Problems with Unbalanced Data [J].
Anwar, Nafees ;
Jones, Geoff ;
Ganesh, Siva .
STATISTICAL ANALYSIS AND DATA MINING, 2014, 7 (03) :194-211
[3]   Behavioral Analysis of Insider Threat: A Survey and Bootstrapped Prediction in Imbalanced Data [J].
Azaria, Amos ;
Richardson, Ariella ;
Kraus, Sarit ;
Subrahmanian, V. S. .
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2014, 1 (02) :135-155
[4]  
Batista G. E. A. P. A., 2004, ACM SIGKDD Explor Newsl, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[5]   A Survey of Predictive Modeling on Im balanced Domains [J].
Branco, Paula ;
Torgo, Luis ;
Ribeiro, Rita P. .
ACM COMPUTING SURVEYS, 2016, 49 (02)
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]  
Breiman L., 1984, Classification and Regression Trees, DOI DOI 10.1201/9781315139470
[8]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[9]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+
[10]   Tuning Support Vector Machines for Minimax and Neyman-Pearson Classification [J].
Davenport, Mark A. ;
Baraniuk, Richard G. ;
Scott, Clayton D. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2010, 32 (10) :1888-1898