The impact of using biased performance metrics on software defect prediction research

被引:45
作者
Yao, Jingxiu [1 ]
Shepperd, Martin [2 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Brunel Univ London, London, England
关键词
Software engineering; Machine learning; Software defect prediction; Computational experiment; Classification metrics; CLASSIFICATION; REVIEWS;
D O I
10.1016/j.infsof.2021.106664
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Software engineering researchers have undertaken many experiments investigating the potential of software defect prediction algorithms. Unfortunately some widely used performance metrics are known to be problematic, most notably F1, but nevertheless F1 is widely used. Objective: To investigate the potential impact of using F1 on the validity of this large body of research. Method: We undertook a systematic review to locate relevant experiments and then extract all pairwise comparisons of defect prediction performance using F1 and the unbiased Matthews correlation coefficient (MCC). Results: We found a total of 38 primary studies. These contain 12,471 pairs of results. Of these comparisons, 21.95% changed direction when the MCC metric is used instead of the biased F1 metric. Unfortunately, we also found evidence suggesting that F1 remains widely used in software defect prediction research. Conclusion: We reiterate the concerns of statisticians that the F1 is a problematic metric outside of an information retrieval context, since we are concerned about both classes (defect-prone and not defect-prone units). This inappropriate usage has led to a substantial number (more than one fifth) of erroneous (in terms of direction) results. Therefore we urge researchers to (i) use an unbiased metric and (ii) publish detailed results including confusion matrices such that alternative analyses become possible.
引用
收藏
页数:14
相关论文
共 79 条
[1]  
Abaei G., 2018, J KING SAUD U COMPUT
[2]   Predicting Fault-Proneness of Reused Object-Oriented Classes in Software Post-Releases [J].
Al Dallal, Jehad .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2018, 43 (12) :7153-7166
[3]  
Ali U., 2020, INT J MOD ED COMPUT, V12
[4]   A tragedy of errors [J].
Allison, David B. ;
Brown, Andrew W. ;
George, Brandon J. ;
Kaiser, Kathryn A. .
NATURE, 2016, 530 (7588) :27-29
[5]   Cross-version defect prediction: use historical data, cross-project data, or both? [J].
Amasaki, Sousuke .
EMPIRICAL SOFTWARE ENGINEERING, 2020, 25 (02) :1573-1595
[6]   Cross-Version Defect Prediction using Cross-Project Defect Prediction Approaches: Does it work? [J].
Amasaki, Sousuke .
PROMISE'18: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON PREDICTIVE MODELS AND DATA ANALYTICS IN SOFTWARE ENGINEERING, 2018, :32-41
[7]   Enhanced Bug Prediction in Java']JavaScript Programs with Hybrid Call-Graph Based Invocation Metrics [J].
Antal, Gabor ;
Toth, Zoltan ;
Hegedus, Peter ;
Ferenc, Rudolf .
TECHNOLOGIES, 2021, 9 (01)
[8]  
Ayon SI, 2019, 2019 1 INT C ADV SCI, P1
[9]   Assessing the accuracy of prediction algorithms for classification: an overview [J].
Baldi, P ;
Brunak, S ;
Chauvin, Y ;
Andersen, CAF ;
Nielsen, H .
BIOINFORMATICS, 2000, 16 (05) :412-424
[10]   On the time-based conclusion stability of cross-project defect prediction models [J].
Bangash, Abdul Ali ;
Sahar, Hareem ;
Hindle, Abram ;
Ali, Karim .
EMPIRICAL SOFTWARE ENGINEERING, 2020, 25 (06) :5047-5083