Data Mining Techniques for Software Effort Estimation: A Comparative Study

被引:120
作者
Dejaeger, Karel [1 ]
Verbeke, Wouter [1 ]
Martens, David [2 ]
Baesens, Bart [1 ,3 ]
机构
[1] Katholieke Univ Leuven, Dept Decis Sci & Informat Management, B-3000 Louvain, Belgium
[2] Univ Antwerp, Fac Appl Econ, B-2000 Antwerp, Belgium
[3] Univ Southampton, Sch Management, Highfield Southampton SO17 1BJ, Hants, England
关键词
Data mining; software effort estimation; regression; COST ESTIMATION; FEEDFORWARD NETWORKS; EMPIRICAL VALIDATION; MUTUAL INFORMATION; EFFORT PREDICTION; FEATURE-SELECTION; NEURAL-NETWORKS; MODELS; CLASSIFICATION; ANALOGY;
D O I
10.1109/TSE.2011.55
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A predictive model is required to be accurate and comprehensible in order to inspire confidence in a business setting. Both aspects have been assessed in a software effort estimation setting by previous studies. However, no univocal conclusion as to which technique is the most suited has been reached. This study addresses this issue by reporting on the results of a large scale benchmarking study. Different types of techniques are under consideration, including techniques inducing tree/rule-based models like M5 and CART, linear models such as various types of linear regression, nonlinear models (MARS, multilayered perceptron neural networks, radial basis function networks, and least squares support vector machines), and estimation techniques that do not explicitly induce a model (e.g., a case-based reasoning approach). Furthermore, the aspect of feature subset selection by using a generic backward input selection wrapper is investigated. The results are subjected to rigorous statistical testing and indicate that ordinary least squares regression in combination with a logarithmic transformation performs best. Another key finding is that by selecting a subset of highly predictive attributes such as project size, development, and environment related attributes, typically a significant increase in estimation accuracy can be obtained.
引用
收藏
页码:375 / 397
页数:23
相关论文
共 108 条
  • [1] SOFTWARE FUNCTION, SOURCE LINES OF CODE, AND DEVELOPMENT EFFORT PREDICTION - A SOFTWARE SCIENCE VALIDATION
    ALBRECHT, AJ
    GAFFNEY, JE
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1983, 9 (06) : 639 - 648
  • [2] Altendorf E. E., 2005, UAI
  • [3] Survey and critique of techniques for extracting rules from trained artificial neural networks
    Andrews, R
    Diederich, J
    Tickle, AB
    [J]. KNOWLEDGE-BASED SYSTEMS, 1995, 8 (06) : 373 - 389
  • [4] [Anonymous], EUR J INF SYST
  • [5] [Anonymous], 1981, Software Engineering Economics
  • [6] Optimal project feature weights in analogy-based cost estimation: Improvement and limitations
    Auer, M
    Trendowicz, A
    Graser, B
    Haunschmid, E
    Biffl, S
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2006, 32 (02) : 83 - 92
  • [7] Azzeh M., 2008, P 4 INT WORKSH PRED, P71, DOI DOI 10.1145/1370788.1370805
  • [8] Benchmarking state-of-the-art classification algorithms for credit scoring
    Baesens, B
    Van Gestel, T
    Viaene, S
    Stepanova, M
    Suykens, J
    Vanthienen, J
    [J]. JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 2003, 54 (06) : 627 - 635
  • [9] 50 years of data mining and OR: upcoming trends and challenges
    Baesens, B.
    Mues, C.
    Martens, D.
    Vanthienen, J.
    [J]. JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 2009, 60 : S16 - S23
  • [10] Bishop CM., 1995, NEURAL NETWORKS PATT