An embedded imputation method via Attribute-based Decision Graphs

被引:11
作者
Bertini Junior, Joao Roberto [1 ]
Nicoletti, Maria do Carmo [1 ,2 ]
Zhao, Liang [3 ]
机构
[1] Univ Fed Sao Carlos, Dept Comp Sci, Rod Washington Luis Km 235, BR-13565905 Sao Carlos, SP, Brazil
[2] FACCAMP, R Guatemala 167, BR-13231230 Campo Limpo Paulista, SP, Brazil
[3] Univ Sao Paulo, Sch Philosophy Sci & Literature Ribeirao Preto, Dept Comp Sci & Math, Ave Bandeirantes 3900, BR-14040901 Ribeirao Preto, SP, Brazil
基金
巴西圣保罗研究基金会;
关键词
Missing attribute value; Data imputation; Single imputation; Attribute-based Decision Graphs; Machine learning based imputation; Methods; MISSING VALUE IMPUTATION; MULTIPLE IMPUTATION; VALUES; PREDICTION; REGRESSION; DISCRETE; ERROR;
D O I
10.1016/j.eswa.2016.03.027
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The performance of classification algorithms is highly dependent on the quality of training data. Missing attribute values are quite common in many real world applications, thus, in such cases, a complementary method to improve the quality of the data and, consequently, promote enhancements of the classifier performance, is necessary. To deal with this problem, two strategies are commonly employed in practice, 1) multiple imputation, which often maintains the statistical properties of the original data and, usually, has good performance, at the expense of high computational costs; 2) single imputation, which, in general, provides a suitable solution for data sets with a few missing attribute values, but hardly achieve good results when the number of missing values is high. This paper proposes a new single imputation method which uses Attribute-based Decision Graphs (AbDG) to estimate the missing values. AbDGs are a new type of data graphs which embed the information contained in the training set into a graph structure, built over pre-defined intervals of values from different attributes. As a consequence, similar data instances induce similar subgraphs when projected onto the AbDG, resulting in distinct patterns of connections. The main contribution of the paper is the proposal of a well-defined procedure to perform imputation, by partially matching instances with missing values against the AbDG. The proposed imputation method can effectively deal with data sets having high rates of missing attribute values while presenting low computational cost; a significant result towards the development of robust expert and intelligent systems. The obtained results show evidences that the proposed method is sound and promote qualitative imputation for classification purposes. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:159 / 177
页数:19
相关论文
共 51 条
[1]  
[Anonymous], 2001, Pattern Classification
[2]   A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm [J].
Aydilek, Ibrahim Berkan ;
Arslan, Ahmet .
INFORMATION SCIENCES, 2013, 233 :25-35
[3]  
Bache K., 2013, UCI Machine Learning Repository
[4]  
Bertini JR, 2014, IEEE IJCNN, P1100, DOI 10.1109/IJCNN.2014.6889593
[5]  
Bertini JR, 2013, 2013 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), P1779
[6]  
Bertini JR, 2013, 2013 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), P1802
[7]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]   Multiple Imputation for Missing Data via Sequential Regression Trees [J].
Burgette, Lane F. ;
Reiter, Jerome P. .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2010, 172 (09) :1070-1076
[10]   Recursive partitioning for missing data imputation in the presence of interaction effects [J].
Doove, L. L. ;
Van Buuren, S. ;
Dusseldorp, E. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2014, 72 :92-104