Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data

被引:0
作者
Ayuyev, Vadim V. [2 ]
Jupin, Joseph [1 ]
Harris, Philip W. [3 ]
Obradovic, Zoran [1 ]
机构
[1] Temple Univ, Ctr Informat Sci & Technol, 303 Wachman Hall,1805 N Broad St, Philadelphia, PA 19122 USA
[2] Bauman Moscow State Tech Univ, FNI KF Dept, Kaluga Branch, Kaluga 248600, Russia
[3] Temple Univ, Dept Criminal Justice, Philadelphia, PA 19122 USA
来源
DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS | 2009年 / 5691卷
关键词
data pre-processing; data imputation; clustering; classification; MULTIPLE IMPUTATION;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The appropriate choice of a method for imputation of missing data becomes especially important when the fraction of missing values is large and the data are of mixed type. The proposed dynamic Clustering imputation (DCI) algorithm relies on similarity information from shared neighbors, where mixed type variables are considered together. When evaluated on a public social science dataset of 46,043 mixed type instances with Lip to 33% missing values, DO resulted in more than 20% improved imputation accuracy over Multiple Imputation, Predictive Mean Matching, Linear and Multilevel Regression, and Mean Mode Replacement methods. Data imputed by 6 methods were used for prediction tests by NB-Tree, Random Subset Selection and Neural Network-based classification models. In Our experiments classification accuracy obtained using DCI-preprocessed data was much better than when relying on alternative imputation methods for data preprocessing.
引用
收藏
页码:366 / +
页数:2
相关论文
共 19 条
[1]  
Asuncion A., UCI MACHINE LEARNING
[2]   Oriented principal component analysis for large margin classifiers [J].
Bermejo, S ;
Cabestany, J .
NEURAL NETWORKS, 2001, 14 (10) :1447-1461
[3]   A DISTANCE-BASED ATTRIBUTE SELECTION MEASURE FOR DECISION TREE INDUCTION [J].
DEMANTARAS, RL .
MACHINE LEARNING, 1991, 6 (01) :81-92
[4]  
FUJIKAWA Y, 2002, LECT NOTES ARTIF INT, V2336, P549
[5]  
Gan G., 2007, DATA CLUSTERING THEO
[6]  
Gelman A., 2006, Data analysis using regression and multilevel/hierarchical models, DOI [https://doi.org/10.1017/CBO9780511790942, DOI 10.1017/CBO9780511790942, 10.1017/CBO9780511790942]
[7]  
Gwet K., 2001, STAT TABLES INTERRAT
[8]  
Haykin S., 1998, NEURAL NETWORKS COMP
[9]  
Ho TK, 1998, IEEE T PATTERN ANAL, V20, P832, DOI 10.1109/34.709601
[10]   Analyzing incomplete political science data: An alternative algorithm for multiple imputation [J].
King, G ;
Honaker, J ;
Joseph, A ;
Scheve, K .
AMERICAN POLITICAL SCIENCE REVIEW, 2001, 95 (01) :49-69