Parameter-free imputation for imbalance datasets

被引:1
作者
Theoretical and Empirical Research Group, Department of Computer Science, Chiang Mai University, Chiang Mai [1 ]
50200, Thailand
机构
[1] Theoretical and Empirical Research Group, Department of Computer Science, Chiang Mai University, Chiang Mai
来源
Takum, Jintana | 1600年 / Springer Verlag卷 / 8839期
关键词
Class imbalance; Classification; Imputation; K-nearest neighbours; Parameter-free;
D O I
10.1007/978-3-319-12823-8_27
中图分类号
学科分类号
摘要
Class imbalance is a problem that aims to improve the accuracy of a minority class, while imputation is a process to replace missing values. Traditionally, class imbalance and imputation problems are considered independently. In addition, filled-in minority-class values that are substituted by traditional methods are not sufficient for imbalance datasets. In this paper, we provide a new parameter-free imputation to operate on imbalance datasets by estimating a random value between the mean of the missing value attribute and a value in this attribute of the closet record instance from the missing value record. Our proposed algorithm ignores mean of instances to avoid an over-fitting problem. Consequently, experimental results on imbalance datasets reveal that our imputation outperforms other techniques, when class imbalance measures are used. © Springer International Publishing Switzerland 2014.
引用
收藏
页码:260 / 267
页数:7
相关论文
共 14 条
  • [1] Gelman A., Hill J., Data Analysis Using Regression and Multi-level/Hierarchical Models, Missing-data Imputation, pp. 529-544, (2006)
  • [2] Batista G., Monard M.C., A study of K-nearest neighbour as an imputation method, Hybrid Intell. Syst., Ser. Front Artif. Intell. Appl, 87, pp. 251-260, (2002)
  • [3] Batista G., Monard M.C., Experimental comparison of K-nearest neighbour and mean or mode imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data. Tech. Rep, (2003)
  • [4] Blake C.L., Merz C.J., UCI Repository of Machine Learning Databases. Department of Information and Computer Sci-ences, (2009)
  • [5] Bradley A.P., The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms, Pattern Recognition, 30, 6, pp. 1145-1159, (1997)
  • [6] Buckland M., Gey F., The Relationship between Recall and Precision, Journal of the American Society for Information Science, 45, 1, pp. 12-19, (1994)
  • [7] Bunkhumpornpat C., Subpaiboonkit S., Safe Level Graph for Synthetic Minority Oversampling Techniques, The 13th International Symposium on Communications and Information Technologies (ISCIT) indexed in IEEE Xplore, pp. 570-575, (2013)
  • [8] Zhu H., Lee S.-Y., Wei B.-C., Zhou J., Case-deletion meas-ures for models with incomplete data, Biometrika, pp. 727-737, (2001)
  • [9] Japkowicz N., Class imbalance Problem: Significance and Strategies, The 2000 International Conference on Artificial Intelligence (IC-AI 2000), pp. 111-117, (2000)
  • [10] Hall M.A., Frank E., Witten I.H., Data Mining: Practical Machine Learning Tools and Techniques, The Kaufmann Series in Data Management Systems, (2011)