The effect of imbalanced data sets on LDA: A theoretical and empirical analysis

被引:69
作者
Xie, Jigang [1 ]
Qiu, Zhengding [1 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
关键词
imbalanced data sets; linear discriminant analysis (LDA); random sampling; tomek links; smote;
D O I
10.1016/j.patcog.2006.01.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper demonstrates that the imbalanced data sets have a negative effect on the performance of LDA theoretically. This theoretical analysis is confirmed by the experimental results: using several sampling methods to rebalance the imbalanced data sets, it is found that the performances of LDA on balanced data sets are superior to those of LDA on imbalanced data sets. (c) 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
引用
收藏
页码:557 / 562
页数:6
相关论文
共 10 条
[1]  
Blake C.L., 1998, UCI repository of machine learning databases
[2]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[3]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[4]  
Chawla NV, 2003, P ICML 2003 WORKSH L
[5]  
CHAWLA NV, 2004, ACM SIGKDD EXPLORATI, V6
[6]   THE MEANING AND USE OF THE AREA UNDER A RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE [J].
HANLEY, JA ;
MCNEIL, BJ .
RADIOLOGY, 1982, 143 (01) :29-36
[7]   Statistical pattern recognition: A review [J].
Jain, AK ;
Duin, RPW ;
Mao, JC .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (01) :4-37
[8]  
JAPKOWICA N, 2000, P AAAI 2000 WORKSH L
[9]  
McLachlan GJ., 2005, Discriminant analysis and statistical pattern recognition
[10]   2 MODIFICATIONS OF CNN [J].
TOMEK, I .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1976, 6 (11) :769-772