Reliable prediction of software defects using Shapley interpretable machine learning models

被引：7

作者：

Al-Smadi, Yazan ^{[1
]}

Eshtay, Mohammed ^{[2
]}

Al-Qerem, Ahmad ^{[1
]}

Nashwan, Shadi ^{[3
]}

Ouda, Osama ^{[3
]}

Abd El-Aziz, A. A. ^{[4
,5
]}

机构：

[1] Zarqa Univ, Fac Informat Technol, Dept Comp Sci, Zarqa 13110, Jordan

[2] Luminus Tech Univ Coll, Amman 11118, Jordan

[3] Jouf Univ, Comp & Informat Sci Coll, Comp Sci Dept, Sakaka 72388, Saudi Arabia

[4] Jouf Univ, Comp & Informat Sci Coll, Informat Syst Dept, Sakaka 72388, Saudi Arabia

[5] Cairo UNI, Fac Grad Studies Stat Res, Informat Syst & Technol Dept, Giza, Egypt

来源：

EGYPTIAN INFORMATICS JOURNAL | 2023年 / 24卷 / 03期

关键词：

Software Defect Prediction; Feature importance; Machine learning; Model interpretation; Shapley Additive Explanation;

D O I：

10.1016/j.eij.2023.05.011

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Predicting defect-prone software components can play a significant role in allocating relevant testing resources to fault-prone modules and hence increasing the business value of software projects. Most of the current software defect prediction studies utilize traditional supervised machine learning algorithms to predict defects in software applications. The software datasets utilized in such studies are imbalanced and therefore the reported results cannot be reliably used to judge their performance. Moreover, it is important to explain the output of machine learning models employed in fault-predication techniques to determine the contribution of each utilized feature to the model output. In this paper, we propose a new framework for predicting software defects utilizing eleven machine learning classifiers over twelve different datasets. For feature selection, we employ four different nature-inspired search algorithms, namely, particle swarm optimization, genetic algorithm, harmony algorithm, and ant colony optimization. Moreover, we make use of the synthetic minority oversampling technique (SMOTE) to address the problem of data imbalance. Furthermore, we utilize the Shapley additive explanation model for highlighting the highest determinative features. The obtained results demonstrate that gradient boosting, stochastic gradient boosting, decision trees, and categorical boosting outperform others tested model with over 90% accuracy and ROC-AUC. Additionally, we found that the ant colony optimization technique outperforms the other tested feature extraction techniques.

引用

页数：20

共 57 条

[1] Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review
Abu Alfeilat, Haneen Arafat
Hassanat, Ahmad B. A.
Lasassmeh, Omar
Tarawneh, Ahmad S.
Alhasanat, Mahmoud Bashir
Salman, Hamzeh S. Eyal
Prasath, V. B. Surya
[J]. BIG DATA, 2019, 7 (04) : 221 - 248
[2] A Comprehensive Survey of the Harmony Search Algorithm in Clustering Applications
Abualigah, Laith
Diabat, Ali
Geem, Zong Woo
[J]. APPLIED SCIENCES-BASEL, 2020, 10 (11):
[3] Comprehensive Review of the Development of the Harmony Search Algorithm and Its Applications
Al-Omoush, Ala'a A.
Alsewari, Abdulrahman A.
Alamri, Hammoudeh S.
Zamli, Kamal Z.
[J]. IEEE ACCESS, 2019, 7 : 14233 - 14245
[4] Default Prediction Model: The Significant Role of Data Engineering in the Quality of Outcomes
Al-Qerem, Ahmad
Al-Naymat, Ghazi
Alhasan, Mays
Al-Debei, Mutaz
[J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2020, 17 (4A) : 635 - 644
[5] Enhancing Small Medical Dataset Classification Performance Using GAN
Alauthman, Mohammad
Al-qerem, Ahmad
Sowan, Bilal
Alsarhan, Ayoub
Eshtay, Mohammed
Aldweesh, Amjad
Aslam, Nauman
[J]. INFORMATICS-BASEL, 2023, 10 (01):
[6] Tabular Data Generation to Improve Classification of Liver Disease Diagnosis
Alauthman, Mohammad
Aldweesh, Amjad
Al-qerem, Ahmad
Aburub, Faisal
Al-Smadi, Yazan
Abaker, Awad M. M.
Alzubi, Omar Radhi
Alzubi, Bilal
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (04):
[7] Ali U., 2020, Modern Education and Computer Science, V12, P29, DOI 10.5815/ijmecs.2020.05.03
[8] Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization
Alibrahim, Hussain
Ludwig, Simone A.
[J]. 2021 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC 2021), 2021, : 1551 - 1559
[9] Alsaeedi Abdullah, 2019, Journal of Software Engineering and Applications, V12, P85, DOI DOI 10.4236/JSEA.2019.125007
[10] Feature selection using firefly algorithm in software defect prediction
Anbu, M.
Mala, G. S. Anandha
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 5): : 10925 - 10934

← 1 2 3 4 5 6 →