Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm

被引:22
作者
D'Ambrosio, Antonio [1 ,2 ]
Aria, Massimo [1 ,2 ]
Siciliano, Roberta [1 ,2 ]
机构
[1] Univ Naples Federico II, Dept Math & Stat, I-80126 Naples, Italy
[2] Univ Naples Federico II, I-80138 Naples, Italy
关键词
Data editing; Tree-based methods; Boosting algorithm; FAST algorithm; Incremental imputation; Generalization error; INCOMPLETE DATA;
D O I
10.1007/s00357-012-9108-1
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Framework of this paper is statistical data editing, specifically how to edit or impute missing or contradictory data and how to merge two independent data sets presenting some lack of information. Assuming a missing at random mechanism, this paper provides an accurate tree-based methodology for both missing data imputation and data fusion that is justified within the Statistical Learning Theory of Vapnik. It considers both an incremental variable imputation method to improve computational efficiency as well as boosted trees to gain in prediction accuracy with respect to other methods. As a result, the best approximation of the structural risk (also known as irreducible error) is reached, thus reducing at minimum the generalization (or prediction) error of imputation. Moreover, it is distribution free, it holds independently of the underlying probability law generating missing data values. Performance analysis is discussed considering simulation case studies and real world applications.
引用
收藏
页码:227 / 258
页数:32
相关论文
共 55 条
[1]  
ALUJA-BANET T., 1997, ENQUETES SONDAGES, P94
[2]  
ALUJA-BANET T., 1998, ANAL MULTIDIMENSIONE, P7
[3]   GRAFT, a complete system for data fusion [J].
Aluja-Banet, Tomas ;
Daunis-i-Estadella, Josep ;
Pellicer, David .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 52 (02) :635-649
[4]  
[Anonymous], 1987, STAT ANAL MISSING DA
[5]  
[Anonymous], 1984, OLSHEN STONE CLASSIF, DOI 10.2307/2530946
[6]  
Barcena M.J., 2000, Proceedings in Computational Statistics, P193
[7]  
BARCENA M.J., 1999, QUESTIIO, V23, P297
[8]  
Breiman L, 1998, ANN STAT, V26, P801
[9]  
BREIMAN L., 1996, Machine Learning, V26, P46
[10]   A statistical approach to growing a reliable honest tree [J].
Cappelli, C ;
Mola, F ;
Siciliano, R .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2002, 38 (03) :285-299