Researcher Bias: The Use of Machine Learning in Software Defect Prediction

被引：251

作者：

Shepperd, Martin ^{[1
]}

Bowes, David ^{[2
]}

Hall, Tracy ^{[1
]}

机构：

[1] Brunel Univ, Uxbridge UB8 3PH, Middx, England

[2] Univ Hertfordshire, Sci & Technol Res Inst, Hatfield AL10 9AB, Herts, England

来源：

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING | 2014年 / 40卷 / 06期

基金：

英国工程与自然科学研究理事会;

关键词：

Software defect prediction; meta-analysis; researcher bias; FAULT-PRONENESS; EMPIRICAL-ANALYSIS; CLASSIFICATION; METRICS; MODELS; ANOVA; METAANALYSIS; QUALITY;

D O I：

10.1109/TSE.2014.2322358

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Background. The ability to predict defect-prone software components would be valuable. Consequently, there have been many empirical studies to evaluate the performance of different techniques endeavouring to accomplish this effectively. However no one technique dominates and so designing a reliable defect prediction model remains problematic. Objective. We seek to make sense of the many conflicting experimental results and understand which factors have the largest effect on predictive performance. Method. We conduct a meta-analysis of all relevant, high quality primary studies of defect prediction to determine what factors influence predictive performance. This is based on 42 primary studies that satisfy our inclusion criteria that collectively report 600 sets of empirical prediction results. By reverse engineering a common response variable we build a random effects ANOVA model to examine the relative contribution of four model building factors (classifier, data set, input metrics and researcher group) to model prediction performance. Results. Surprisingly we find that the choice of classifier has little impact upon performance (1.3 percent) and in contrast the major (31 percent) explanatory factor is the researcher group. It matters more who does the work than what is done. Conclusion. To overcome this high level of researcher bias, defect prediction researchers should (i) conduct blind analysis, (ii) improve reporting protocols and (iii) conduct more intergroup studies in order to alleviate expertise issues. Lastly, research is required to determine whether this bias is prevalent in other applications domains.

引用

页码：603 / 616

页数：14

共 94 条

[1]

[Anonymous], 1999, Technometrics, DOI DOI 10.2307/1269742

[2]

[Anonymous], 2006, ISESE '06: Proceedings of the 5th International Symposium on Empirical Software Engineering. Volume II: Short Papers and Posters, DOI [10.1145/1159733.1159739, DOI 10.1145/1159733.1159739.]

[3]

[Anonymous], 2007, 3 INT WORKSH PRED MO

[4]

[Anonymous], 1995, Proceedings of the 14th International Joint Conference on Artificial Intelligence

[5]

[Anonymous], 2010, SIGKDD Explor.

[6]

Arisholm E., 2008, TR200806 U OSL DEP I

[7] Data mining techniques for building fault-proneness models in telecom Java']Java softwarea [J].

Arisholm, Erik ;

Biland, Lionel C. ;

Fuglerud, Magnus .

ISSRE 2007: 18TH IEEE INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING, PROCEEDINGS, 2007, :215-+

[8] A systematic and comprehensive investigation of methods to build and evaluate fault prediction models [J].

Arisholm, Erik ;

Briand, Lionel C. ;

Johannessen, Eivind B. .

JOURNAL OF SYSTEMS AND SOFTWARE, 2010, 83 (01) :2-17

[9] Assessing the accuracy of prediction algorithms for classification: an overview [J].

Baldi, P ;

Brunak, S ;

Chauvin, Y ;

Andersen, CAF ;

Nielsen, H .

BIOINFORMATICS, 2000, 16 (05) :412-424

[10]

Bebarta V, 2003, ACAD EMERG MED, V10, P684, DOI 10.1111/j.1553-2712.2003.tb00056.x

← 1 2 3 4 5 6 7 8 9 10 →