Improving performance for classification with incomplete data using wrapper-based feature selection

被引:19
作者
Tran C.T. [1 ]
Zhang M. [1 ]
Andreae P. [1 ]
Xue B. [1 ]
机构
[1] Evolutionary Computation Research Group, School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington
关键词
C4.5; Classification; Feature selection; Incomplete data; Missing data; Missing values; Particle swarm optimisation;
D O I
10.1007/s12065-016-0141-6
中图分类号
学科分类号
摘要
Missing values are an unavoidable problem of many real-world datasets. Inadequate treatment of missing values may result in large errors on classification; thus, dealing well with missing values is essential for classification. Feature selection has been well known for improving classification, but it has been seldom used for improving classification with incomplete datasets. Moreover, some classifiers such as C4.5 are able to directly classify incomplete datasets, but they often generate more complex classifiers with larger classification errors. The purpose of this paper is to propose a wrapper-based feature selection method to improve the ability of a classifier able to classify incomplete datasets. In order to achieve the purpose, the feature selection method evaluates feature subsets using a classifier able to classify incomplete datasets. Empirical results on 14 datasets using particle swarm optimisation for searching feature subsets and C4.5 for evaluating the feature subsets in the feature selection method show that the wrapper-based feature selection is not only able to improve classification accuracy of the classifier, but also able to reduce the size of trees generated by the classifier. © 2016, Springer-Verlag Berlin Heidelberg.
引用
收藏
页码:81 / 94
页数:13
相关论文
共 44 条
[1]  
Lichman M., UCI Machine Learning Repository, School of Information and Computer Science, (2013)
[2]  
Barnard J., Meng X.-L., Applications of multiple imputation in medical studies: from aids to nhanes, Stat Methods Med Res, 8, pp. 17-36, (1999)
[3]  
Batista G.E., Monard M.C., A study of K-nearest neighbour as an imputation method, HIS, 87, pp. 251-260, (2002)
[4]  
Berger J.O., Statistical decision theory and Bayesian analysis, (2013)
[5]  
Bishop C.M., Pattern recognition and machine learning, (2006)
[6]  
Chuang L.-Y., Chang H.-W., Tu C.-J., Yang C.-H., Improved binary pso for feature selection using gene expression data, Comput Biol Chem, 32, pp. 29-38, (2008)
[7]  
Clark P., Niblett T., The CN2 induction algorithm, Mach Learn, 3, pp. 261-283, (1989)
[8]  
Clerc M., Kennedy J., The particle swarm-explosion, stability, and convergence in a multidimensional complex space, IEEE Trans Evol Comput, 6, pp. 58-73, (2002)
[9]  
Dash M., Liu H., Feature selection for classification, Intell Data Anal, 1, pp. 131-156, (1997)
[10]  
De'ath G., Fabricius K.E., Classification and regression trees: a powerful yet simple technique for ecological data analysis, Ecology, 81, pp. 3178-3192, (2000)