Parameter-free imputation for imbalance datasets

被引：1

作者：

Theoretical and Empirical Research Group, Department of Computer Science, Chiang Mai University, Chiang Mai ^{[1
]}

50200, Thailand

机构：

[1] Theoretical and Empirical Research Group, Department of Computer Science, Chiang Mai University, Chiang Mai

来源：

Takum, Jintana | 1600年 / Springer Verlag卷 / 8839期

关键词：

Class imbalance; Classification; Imputation; K-nearest neighbours; Parameter-free;

D O I：

10.1007/978-3-319-12823-8_27

中图分类号：

学科分类号：

摘要：

Class imbalance is a problem that aims to improve the accuracy of a minority class, while imputation is a process to replace missing values. Traditionally, class imbalance and imputation problems are considered independently. In addition, filled-in minority-class values that are substituted by traditional methods are not sufficient for imbalance datasets. In this paper, we provide a new parameter-free imputation to operate on imbalance datasets by estimating a random value between the mean of the missing value attribute and a value in this attribute of the closet record instance from the missing value record. Our proposed algorithm ignores mean of instances to avoid an over-fitting problem. Consequently, experimental results on imbalance datasets reveal that our imputation outperforms other techniques, when class imbalance measures are used. © Springer International Publishing Switzerland 2014.

引用

页码：260 / 267

页数：7

共 14 条

[1] Gelman A., Hill J., Data Analysis Using Regression and Multi-level/Hierarchical Models, Missing-data Imputation, pp. 529-544, (2006)
[2] Batista G., Monard M.C., A study of K-nearest neighbour as an imputation method, Hybrid Intell. Syst., Ser. Front Artif. Intell. Appl, 87, pp. 251-260, (2002)
[3] Batista G., Monard M.C., Experimental comparison of K-nearest neighbour and mean or mode imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data. Tech. Rep, (2003)
[4] Blake C.L., Merz C.J., UCI Repository of Machine Learning Databases. Department of Information and Computer Sci-ences, (2009)
[5] Bradley A.P., The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms, Pattern Recognition, 30, 6, pp. 1145-1159, (1997)
[6] Buckland M., Gey F., The Relationship between Recall and Precision, Journal of the American Society for Information Science, 45, 1, pp. 12-19, (1994)
[7] Bunkhumpornpat C., Subpaiboonkit S., Safe Level Graph for Synthetic Minority Oversampling Techniques, The 13th International Symposium on Communications and Information Technologies (ISCIT) indexed in IEEE Xplore, pp. 570-575, (2013)
[8] Zhu H., Lee S.-Y., Wei B.-C., Zhou J., Case-deletion meas-ures for models with incomplete data, Biometrika, pp. 727-737, (2001)
[9] Japkowicz N., Class imbalance Problem: Significance and Strategies, The 2000 International Conference on Artificial Intelligence (IC-AI 2000), pp. 111-117, (2000)
[10] Hall M.A., Frank E., Witten I.H., Data Mining: Practical Machine Learning Tools and Techniques, The Kaufmann Series in Data Management Systems, (2011)

← 1 2 →