Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

被引:47
作者
Song, Qinbao [1 ]
Shepperd, Martin [2 ]
Chen, Xiangru [1 ]
Liu, Jun [3 ]
机构
[1] Xi An Jiao Tong Univ, Dept Comp Sci & Technol, Xian 710049, Shaanxi, Peoples R China
[2] Brunel Univ, Sch IS Comp & Maths, Uxbridge UB8 3PH, Middx, England
[3] Shaanxi Elect Power Training Ctr Staff Members, Xian 710038, Shaanxi, Peoples R China
基金
中国国家自然科学基金;
关键词
Missing data; Missing data toleration; C4.5; Data imputation; Software project cost prediction;
D O I
10.1016/j.jss.2008.05.008
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict Cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing Values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%. (C) 2008 Elsevier Inc. All rights reserved.
引用
收藏
页码:2361 / 2370
页数:10
相关论文
共 72 条
[1]   SOFTWARE FUNCTION, SOURCE LINES OF CODE, AND DEVELOPMENT EFFORT PREDICTION - A SOFTWARE SCIENCE VALIDATION [J].
ALBRECHT, AJ ;
GAFFNEY, JE .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1983, 9 (06) :639-648
[2]  
[Anonymous], P 20 INT C MACH LEAR
[3]  
[Anonymous], MACHINE LEARNING
[4]  
[Anonymous], 2000, Using Multivariate Statistics
[5]  
[Anonymous], P 7 SOFTW METR S LON
[6]  
[Anonymous], P 7 IEEE INT SOFTW M
[7]  
[Anonymous], 1989, MANAGERIAL DECISIONS
[8]   A NEW APPROACH TO THE SPECIES CLASSIFICATION PROBLEM IN FLORISTIC ANALYSIS [J].
AUSTIN, MP ;
BELBIN, L .
AUSTRALIAN JOURNAL OF ECOLOGY, 1982, 7 (01) :75-89
[9]  
Batista GEAPA, 2003, APPL ARTIF INTELL, V17, P519, DOI [10.1080/713827181, 10.1080/08839510390219309]
[10]  
Boehm Barry W., 1981, Software Engineering Economics, V1st