Mixture of Gaussians for distance estimation with missing data

被引:49
作者
Eirola, Emil [1 ]
Lendasse, Amaury [1 ,2 ,3 ,4 ]
Vandewalle, Vincent [5 ,6 ]
Biernacki, Christophe [4 ,6 ]
机构
[1] Aalto Univ, Dept Informat & Comp Sci, FI-00076 Aalto, Finland
[2] Basque Fdn Sci, IKERBASQUE, Bilbao 48011, Spain
[3] Univ Basque Country, Fac Comp Sci, Computat Intelligence Grp, Donostia San Sebastian, Spain
[4] Univ Lille 1, CNRS, Lab P Painleve, UMR 8524, F-59655 Villeneuve Dascq, France
[5] Univ Lille 2, EA 2694, F-59045 Lille, France
[6] INRIA Lille Nord Europe, MODAL Team, F-59650 Villeneuve Dascq, France
关键词
Missing data; Distance estimation; Mixture model; SELECTION; LIKELIHOOD;
D O I
10.1016/j.neucom.2013.07.050
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many data sets have missing values in practical application contexts, but the majority of commonly studied machine learning methods cannot be applied directly when there are incomplete samples. However, most such methods only depend on the relative differences between samples instead of their particular values, and thus one useful approach is to directly estimate the pairwise distances between all samples in the data set. This is accomplished by fitting a Gaussian mixture model to the data, and using it to derive estimates for the distances. A variant of the model for high-dimensional data with missing values is also studied. Experimental simulations confirm that the proposed method provides accurate estimates compared to alternative methods for estimating distances. In particular, using the mixture model for estimating distances is on average more accurate than using the same model to impute any missing values and then calculating distances. The experimental evaluation additionally shows that more accurately estimating distances lead to improved prediction performance for classification and regression tasks when used as inputs for a neural network. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:32 / 42
页数:11
相关论文
共 33 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]  
Anderson T. W., 2003, An Introduction to Multivariate Statistical Analysis, V3rd
[3]  
[Anonymous], UCI MACHINE LEARNING
[4]  
[Anonymous], 2006, Pattern recognition and machine learning
[5]   A conservative feature subset selection algorithm with missing data [J].
Aussem, Alex ;
de Morais, Sergio Rodrigues .
NEUROCOMPUTING, 2010, 73 (4-6) :585-590
[6]   Direct likelihood analysis versus simple forms of imputation for missing data in randomized clinical trials [J].
Beunckens, C ;
Molenberghs, G ;
Kenward, MG .
CLINICAL TRIALS, 2005, 2 (05) :379-386
[7]   High-dimensional data clustering [J].
Bouveyron, C. ;
Girard, S. ;
Schmid, C. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 52 (01) :502-519
[8]   SCREE TEST FOR NUMBER OF FACTORS [J].
CATTELL, RB .
MULTIVARIATE BEHAVIORAL RESEARCH, 1966, 1 (02) :245-276
[9]  
Chapelle Olivier, 2006, IEEE Transactions on Neural Networks, DOI DOI 10.1109/TNN.2009.2015974
[10]  
Delalleau O., 2012, ABS12090521 CORR