Data quality issues in software fault prediction: a systematic literature review

被引:15
作者
Bhandari, Kirti [1 ]
Kumar, Kuldeep [1 ]
Sangal, Amrit Lal [1 ]
机构
[1] Dr BR Ambedkar Natl Inst Technol, Dept Comp Sci & Engn, Jalandhar 144011, Punjab, India
关键词
Software fault prediction; Systematic literature review; Systematic mapping; Data quality issues; MACHINE LEARNING TECHNIQUES; CLASS IMBALANCE PROBLEM; DEFECT PREDICTION; FEATURE-SELECTION; OPTIMIZATION ALGORITHM; ATTRIBUTE SELECTION; CLASS OVERLAP; ENSEMBLE; MODEL; FRAMEWORK;
D O I
10.1007/s10462-022-10371-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software fault prediction (SFP) aims to improve software quality with a possible minimum cost and time. Various machine learning models have been proposed in the past for predicting software faults. The performance of those models depends on dataset quality and can be enhanced by identifying and eliminating data quality issues. In this paper, we present a systematic literature review on data quality issues existing in SFP datasets. We have selected 145 primary studies published until November 2021 and analyzed them from five perspectives & mdash;data quality issue, pre-processing technique, modeling technique, data set and performance measures used. The findings indicate that data quality issues such as data dimensionality, class imbalance and their combination have been heavily considered in the literature. However, data quality issues such as class overlapping, missing data are pertinent to SFP datasets and need further investigation. The effect of resolving one data quality issue relative to others is an unexplored field. C4.5, naive Bayes, multilayer perceptron, support vector machine, and random forest are the most frequently used classifiers by the researchers. However, researchers should know the sensitiveness of those classifiers corresponding to a particular data quality issue and select them accordingly. The PROMISE datasets have been extensively used in SFP. Accuracy, precision, recall and area under curve are the common performance measures. It is suggested to employ unbiased and stable performance measures such as Mathew Co-relation Coefficient for the model evaluation. Our findings from the survey concluded that the existence of data quality issues in SFP datasets degrades the classifiers' performance and there is a scope for further research on data quality issues.
引用
收藏
页码:7839 / 7908
页数:70
相关论文
共 209 条
[1]   Increasing the Accuracy of Software Fault Prediction using Majority Ranking Fuzzy Clustering [J].
Abaei, Golnoush ;
Selamat, Ali .
INTERNATIONAL JOURNAL OF SOFTWARE INNOVATION, 2014, 2 (04) :60-71
[2]   An Enhanced Evolutionary Software Defect Prediction Method Using Island Moth Flame Optimization [J].
Abu Khurma, Ruba ;
Alsawalqah, Hamad ;
Aljarah, Ibrahim ;
Abd Elaziz, Mohamed ;
Damasevicius, Robertas .
MATHEMATICS, 2021, 9 (15)
[3]  
ADRION WR, 1982, COMPUT SURV, V14, P159, DOI 10.1145/356876.356879
[4]   Is "Better Data" Better Than "Better Data Miners"? On the Benefits of Tuning SMOTE for Defect Prediction [J].
Agrawal, Amritanshu ;
Menzies, Tim .
PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, :1050-1061
[5]  
Alan Oral, 2009, 2009 24th International Symposium on Computer and Information Sciences (ISCIS), P567, DOI 10.1109/ISCIS.2009.5291882
[6]   Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets [J].
Alan, Oral ;
Catal, Cagatay .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (04) :3440-3445
[7]  
Alsawalqah, 2017, SOFTWARE ENG TRENDS
[8]   Wrapper-based Feature Ranking for Software Engineering Metrics [J].
Altidor, Wilker ;
Khoshgoftaar, Taghi M. ;
Napolitano, Amri .
EIGHTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2009, :241-246
[9]   Feature selection using firefly algorithm in software defect prediction [J].
Anbu, M. ;
Mala, G. S. Anandha .
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 5) :10925-10934
[10]  
[Anonymous], 2008, P 4 INT WORKSH PRED, DOI DOI 10.1145/1370788.1370801