Improving performance of classification on incomplete data using feature selection and clustering

被引：31

作者：

Cao Truong Tran ^{[1
,2
]}

Zhang, Mengjie ^{[1
]}

Andreae, Peter ^{[1
]}

Xue, Bing ^{[1
]}

Lam Thu Bui ^{[2
]}

机构：

[1] Victoria Univ Wellington, Sch Engn & Comp Sci, POB 600, Wellington 6140, New Zealand

[2] Le Quy Don Tech Univ, Res Grp Computat Intelligence, 236 Hoang Quoc Viet St, Hanoi, Vietnam

来源：

APPLIED SOFT COMPUTING | 2018年 / 73卷

关键词：

Incomplete data; Classification; Imputation; Clustering; Feature selection; Differential evolution; MISSING DATA IMPUTATION; MULTIPLE IMPUTATION; DIFFERENTIAL EVOLUTION; ALGORITHM; VALUES; IMPACT;

D O I：

10.1016/j.asoc.2018.09.026

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Missing values are an unavoidable issue in many real-world datasets. One of the most popular approaches to classification with incomplete data is to use imputation to replace missing values with plausible values. However, powerful imputation methods are too computationally intensive when applying a classifier to a new unknown instance. This paper proposes new approaches to integrating imputation, clustering and feature selection for classification with incomplete data in order to improve efficiency without loss of accuracy. Clustering is used to reduce the number of instances used by the imputation. Feature selection is used to remove redundant and irrelevant features of training data which greatly reduces the cost of imputation. The paper also investigates the ability of Differential Evolution (DE) to search feature subsets with incomplete data. Results show that the integration of imputation, clustering and feature selection not only improves classification accuracy, but also dramatically reduces the computation time required to estimate missing values when classifying new instances. (C) 2018 Elsevier B.V. All rights reserved.

引用

页码：848 / 861

页数：14

共 47 条

[1] Acuña E, 2004, ST CLASS DAT ANAL, P639
[2] Feature subset selection using differential evolution and a wheel based search strategy
Al-Ani, Ahmed
Alsukker, Akram
Khushaba, Rami N.
[J]. SWARM AND EVOLUTIONARY COMPUTATION, 2013, 9 : 15 - 26
[3] [Anonymous], 2009, SIGKDD Explorations, DOI DOI 10.1145/1656274.1656278
[4] [Anonymous], 2014, STAT ANAL MISSING DA
[5] [Anonymous], 2011, J STAT SOFTW
[6] [Anonymous], 2000, Pattern Classification, DOI DOI 10.1007/978-3-319-57027-3_4
[7] A conservative feature subset selection algorithm with missing data
Aussem, Alex
de Morais, Sergio Rodrigues
[J]. NEUROCOMPUTING, 2010, 73 (4-6) : 585 - 590
[8] Batista G. E., 2002, SER FRONT ARTIF INTE, V87, P48
[9] Batista GEAPA, 2003, APPL ARTIF INTELL, V17, P519, DOI 10.1080/08839510390219309
[10] Bing Xue, 2017, ACM SIGEVOlution, V10, P4, DOI 10.1145/3089251.3089252

← 1 2 3 4 5 →