Improving identification of difficult small classes by balancing class distribution

被引:450
作者
Laurikkala, J [1 ]
机构
[1] Tampere Univ, Dept Informat & Comp Sci, FIN-33014 Tampere, Finland
来源
ARTIFICIAL INTELLIGENCE IN MEDICINE, PROCEEDINGS | 2001年 / 2101卷
关键词
D O I
10.1007/3-540-48229-6_9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We studied three methods to improve identification of difficult small classes by balancing imbalanced class distribution with data reduction. The new method, neighborhood cleaning rule (NCL), outperformed simple random and one-sided selection methods in experiments with ten data sets. All reduction methods improved identification of small classes (20-30%), but the differences were insignificant. However, significant differences in accuracies, true-positive rates and true-negative rates obtained with the 3-nearest neighbor method and C4.5 from the reduced data favored NCL. The results suggest that NCL is a useful method for improving the modeling of difficult small classes, and for building classifiers to identify these classes from the real-world data.
引用
收藏
页码:63 / 66
页数:4
相关论文
共 9 条
[1]   INSTANCE-BASED LEARNING ALGORITHMS [J].
AHA, DW ;
KIBLER, D ;
ALBERT, MK .
MACHINE LEARNING, 1991, 6 (01) :37-66
[2]  
[Anonymous], 2001, IMPROVING IDENTIFICA
[3]  
Blake C.L., 1998, UCI repository of machine learning databases
[4]  
Cochran W.G., 1953, A Wiley publication in applied statistics
[5]  
Kentala E, 1996, AM J OTOL, V17, P883
[6]  
Kubat M, 1997, P 14 INT C MACH LEAR, P821
[7]  
LAURIKKALA J, 2001, COMPUT BIOL MED, V31
[8]  
Quinlan R, 1993, C4.5: Programs for Machine Learning
[9]   Reduction techniques for instance-based learning algorithms [J].
Wilson, DR ;
Martinez, TR .
MACHINE LEARNING, 2000, 38 (03) :257-286