Evaluating logistic regression models to estimate software project outcomes

被引:32
作者
Cerpa, Narciso [1 ]
Bardeen, Matthew [1 ]
Kitchenham, Barbara [2 ]
Verner, June [3 ]
机构
[1] Univ Talca, Fac Ingn, Curico, Chile
[2] Univ Keele, Sch Comp & Math, Keele ST5 5BG, Staffs, England
[3] Univ New S Wales, Sydney, NSW, Australia
关键词
Project outcome; Tailored cut-off; ROC analysis; Single-company data; Cross-company data; Classifier evaluation; SUCCESS; PRACTITIONERS; PERCEPTIONS; PREDICTION; ACCEPTANCE; THINK; EASE;
D O I
10.1016/j.infsof.2010.03.011
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Software has been developed since the 1960s but the success rate of software development projects is still low. During the development of software, the probability of success is affected by various practices or aspects. To date, it is not clear which of these aspects are more important in influencing project outcome. Objective: In this research, we identify aspects which could influence project success, build prediction models based on the aspects using data collected from multiple companies, and then test their performance on data from a single organization. Method: A survey-based empirical investigation was used to examine variables and factors that contribute to project outcome. Variables that were highly correlated to project success were selected and the set of variables was reduced to three factors by using principal components analysis. A logistic regression model was built for both the set of variables and the set of factors, using heterogeneous data collected from two different countries and a variety of organizations. We tested these models by using a homogeneous hold-out dataset from one organization. We used the receiver operating characteristic (ROC) analysis to compare the performance of the variable and factor-based models when applied to the homogeneous dataset. Results: We found that using raw variables or factors in the logistic regression models did not make any significant difference in predictive capability. The prediction accuracy of these models is more balanced when the cut-off is set to the ratio of success to failures in the datasets used to build the models. We found that the raw variable and factor-based models predict significantly better than random chance. Conclusion: We conclude that an organization wishing to estimate whether a project will succeed or fail may use a model created from heterogeneous data derived from multiple organizations. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:934 / 944
页数:11
相关论文
共 67 条
[21]  
GLASS RL, 1999, COMMUNICATIONS ACM, V42
[22]  
Hagerty N., 2000, P SIGCPR C EV IL, P192
[23]  
Hoffman T., 1999, Computerworld, V33, P24
[24]  
Ishman M., 1998, INFORM SYSTEM SUCCES
[25]  
JONES C, 1995, COMPUTER, V28, P86, DOI [10.1109/2.366170, 10.1109/2.386991]
[26]   How large are software cost overruns? A review of the 1994 CHAOS report [J].
Jorgensen, M ;
Molokken-Ostvold, K .
INFORMATION AND SOFTWARE TECHNOLOGY, 2006, 48 (04) :297-301
[27]   AN EMPIRICAL VALIDATION OF SOFTWARE COST ESTIMATION MODELS [J].
KEMERER, CF .
COMMUNICATIONS OF THE ACM, 1987, 30 (05) :416-429
[28]   Controlling overfitting in software quality models: Experiments with regression trees and classification [J].
Khoshgoftaar, TM ;
Allen, EB ;
Deng, JY .
SEVENTH INTERNATIONAL SOFTWARE METRICS SYMPOSIUM - METRICS 2001, PROCEEDINGS, 2000, :190-198
[29]  
Kitchenham B. A., 1984, ICL Technical Journal, V4, P73
[30]  
Lehmann ErichL., 1998, NONPARAMETRICS STAT