An Empirical Comparison of Model Validation Techniques for Defect Prediction Models

被引:376
作者
Tantithamthavorn, Chakkrit [1 ]
McIntosh, Shane [2 ]
Hassan, Ahmed E. [3 ]
Matsumoto, Kenichi [1 ]
机构
[1] Nara Inst Sci & Technol, Grad Sch Informat Sci, Ikoma 6300192, Japan
[2] McGill Univ, Dept Elect & Comp Engn, Montreal, PQ H3A 0G4, Canada
[3] Queens Univ, Sch Comp, Kingston, ON K7L 3N6, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Defect prediction models; model validation techniques; bootstrap validation; cross validation; holdout validation; CROSS-VALIDATION; ERROR RATE; PROGNOSTIC MODELS; SOFTWARE; REGRESSION; PERFORMANCE; CODE; BOOTSTRAP; FRAMEWORK; SELECTION;
D O I
10.1109/TSE.2016.2584050
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Defect prediction models help software quality assurance teams to allocate their limited resources to the most defect-prone modules. Model validation techniques, such as k-fold cross-validation, use historical data to estimate how well a model will perform in the future. However, little is known about how accurate the estimates of model validation techniques tend to be. In this paper, we investigate the bias and variance of model validation techniques in the domain of defect prediction. Analysis of 101 public defect datasets suggests that 77 percent of them are highly susceptible to producing unstable results-selecting an appropriate model validation technique is a critical experimental design choice. Based on an analysis of 256 studies in the defect prediction literature, we select the 12 most commonly adopted model validation techniques for evaluation. Through a case study of 18 systems, we find that single-repetition holdout validation tends to produce estimates with 46-229 percent more bias and 53-863 percent more variance than the top-ranked model validation techniques. On the other hand, out-of-sample bootstrap validation yields the best balance between the bias and variance of estimates in the context of our study. Therefore, we recommend that future defect prediction studies avoid singlerepetition holdout validation, and instead, use out-of-sample bootstrap validation.
引用
收藏
页码:1 / 18
页数:18
相关论文
共 124 条
  • [1] DANGERS OF USING OPTIMAL CUTPOINTS IN THE EVALUATION OF PROGNOSTIC FACTORS
    ALTMAN, DG
    LAUSEN, B
    SAUERBREI, W
    SCHUMACHER, M
    [J]. JOURNAL OF THE NATIONAL CANCER INSTITUTE, 1994, 86 (11) : 829 - 835
  • [2] Altman DG, 2000, STAT MED, V19, P453, DOI 10.1002/(SICI)1097-0258(20000229)19:4<453::AID-SIM350>3.0.CO
  • [3] 2-5
  • [4] [Anonymous], P INT C PRED MOD SOF
  • [5] [Anonymous], 1993, An introduction to the bootstrap
  • [6] [Anonymous], 2013, P 9 INT C PRED MOD S, DOI [DOI 10.1145/2499393.2499394, 10.1145/2499393.2499394]
  • [7] [Anonymous], 2019, R: A language for environment for statistical computing
  • [8] [Anonymous], 2014, RANDOMFOREST BREIMAN
  • [9] [Anonymous], 2008, Journal of Statistical Software, Code Snippets, DOI [10.18637/jss.v028.c01, DOI 10.18637/JSS.V028.C01]
  • [10] [Anonymous], SCOTTKNOTTESD R PACK