A short note on safest default missingness mechanism assumptions

被引:19
作者
Song, QB [1 ]
Shepperd, M [1 ]
Cartwright, K [1 ]
机构
[1] Sch Design Engn & Comp, Empir Software Engn Res Grp, Bournemouth, Dorset, England
基金
英国工程与自然科学研究理事会;
关键词
software effort prediction; missing data; data imputation; missingness mechanism;
D O I
10.1007/s10664-004-6193-8
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A very common problem when building software engineering models is dealing with missing data. To address this there exist a range of imputation techniques. However, selecting the appropriate imputation technique can also be a difficult problem. One reason for this is that these techniques make assumptions about the underlying missingness mechanism, that is how the missing values are distributed within the data set. It is compounded by the fact that, for small data sets, it may be very difficult to determine what is the missingness mechanism. This means there is a danger of using an inappropriate imputation technique. Therefore, it is necessary to determine what is the safest default assumption about the missingness mechanism for imputation techniques when dealing with small data sets. We examine experimentally, two simple and commonly used techniques: Class Mean Imputation (CMI) and k Nearest Neighbors (k-NN) coupled with two missingness mechanisms: missing completely at random (MCAR) and missing at random (MAR). We draw two conclusions. First, that for our analysis CMI is the preferred technique since it is more accurate. Second, and more importantly, the impact of missingness mechanism on imputation accuracy is not statistically significant. This is a useful finding since it suggests that even for small data sets we can reasonably make a weaker assumption that the missingness mechanism is MAR. Thus both imputation techniques have practical application for small software engineering data sets with missing values.
引用
收藏
页码:235 / 243
页数:9
相关论文
共 12 条
[1]  
Angelis L, 2000, SEVENTH INTERNATIONAL SOFTWARE METRICS SYMPOSIUM - METRICS 2001, PROCEEDINGS, P4
[2]  
Conte S.D., 1986, SOFTWARE ENG METRICS
[3]  
Fix Evelyn., 1952, Discriminatory analysis-nonparametric discrimination: Small sample performance
[4]   Using public domain metrics to estimate software development effort [J].
Jeffery, R ;
Ruhe, M ;
Wieczorek, I .
SEVENTH INTERNATIONAL SOFTWARE METRICS SYMPOSIUM - METRICS 2001, PROCEEDINGS, 2000, :16-27
[5]  
Kirsopp C, 2002, GECCO 2002 GEN EV CO
[6]  
Little R.J., 1987, Statistical Analysis With Missing Data
[8]   Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods [J].
Myrtveit, I ;
Stensrud, E ;
Olsson, UH .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2001, 27 (11) :999-1013
[9]   Learning decision tree classifiers [J].
Quinlan, JR .
ACM COMPUTING SURVEYS, 1996, 28 (01) :71-72
[10]  
REINSDORF MB, 1996, 276 BUR LAB STAT