A new imputation method for small software project data sets

被引:45
作者
Song, Qinbao [1 ]
Shepperd, Martin
机构
[1] Xi An Jiao Tong Univ, Xian 710049, Shaanxi, Peoples R China
[2] Brunel Univ, Uxbridge UB8 3PH, Middx, England
基金
英国工程与自然科学研究理事会;
关键词
software effort prediction; missing data; data imputation; class mean imputation; k-NN imputation;
D O I
10.1016/j.jss.2006.05.003
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Effort prediction is a very important issue for software project management. Historical project data sets are frequently used to Support such prediction. But missing data are often contained in these data sets and this makes prediction more difficult. One common practice is to ignore the cases with missing data, but this makes the originally small software project database even smaller and call further decrease the accuracy of prediction. The alternative is missing data imputation. There are many imputation methods. Software data sets are frequently characterised by their small size but unfortunately sophisticated imputation methods prefer larger data sets. For this reason we explore using simple methods to impute missing data in small project effort data sets. We propose a class mean imputation (CMI) method based on the k-NN hot deck imputation method (MINI) to impute both continuous and nominal missing data in small data sets. We use an incremental approach to increase the variance of population. To evaluate MINI (and k-NN and CMI methods as benchmarks) we use data sets with 50 cases and 100 cases sampled from a larger industrial data set with 10%, 15%, 20% and 30% missing data percentages respectively. We also simulate Missing Completely at Random (MCAR) and Missing at Random (MAR) missingness mechanisms. The results suggest that the MINI method outperforms both CMI and the k-NN methods. We conclude that this new imputation technique can be used to impute missing values in small data sets. (C) 2006 Elsevier Inc. All rights reserved.
引用
收藏
页码:51 / 62
页数:12
相关论文
共 33 条
[1]  
AGGARWAL CC, 2001, P 7 ACM SIGKDD INT C, P227
[2]  
[Anonymous], 2000, Using Multivariate Statistics
[3]  
Conte S.D., 1986, SOFTWARE ENG METRICS
[4]  
Fix Evelyn., 1952, Discriminatory analysis-nonparametric discrimination: Small sample performance
[5]  
FRIEDMAN N, 1998, P 14 C UNC ART INT, P129
[6]  
Grzymala-Busse J., 2000, ough Sets and Current Trends in Computing, P340
[7]  
HAITOVSKY Y, 1968, J R STAT SOC B, V30, P67
[8]   Feature selection: Evaluation, application, and small sample performance [J].
Jain, A ;
Zongker, D .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (02) :153-158
[9]   An evaluation of k-nearest neighbour imputation using Likert data [J].
Jönsson, P ;
Wohlin, C .
10TH INTERNATIONAL SYMPOSIUM ON SOFTWARE METRICS, PROCEEDINGS, 2004, :108-118
[10]  
Joreskog K. G., 1996, LISREL 8: User reference guide