Reliable prediction of software defects using Shapley interpretable machine learning models

被引:7
作者
Al-Smadi, Yazan [1 ]
Eshtay, Mohammed [2 ]
Al-Qerem, Ahmad [1 ]
Nashwan, Shadi [3 ]
Ouda, Osama [3 ]
Abd El-Aziz, A. A. [4 ,5 ]
机构
[1] Zarqa Univ, Fac Informat Technol, Dept Comp Sci, Zarqa 13110, Jordan
[2] Luminus Tech Univ Coll, Amman 11118, Jordan
[3] Jouf Univ, Comp & Informat Sci Coll, Comp Sci Dept, Sakaka 72388, Saudi Arabia
[4] Jouf Univ, Comp & Informat Sci Coll, Informat Syst Dept, Sakaka 72388, Saudi Arabia
[5] Cairo UNI, Fac Grad Studies Stat Res, Informat Syst & Technol Dept, Giza, Egypt
关键词
Software Defect Prediction; Feature importance; Machine learning; Model interpretation; Shapley Additive Explanation;
D O I
10.1016/j.eij.2023.05.011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Predicting defect-prone software components can play a significant role in allocating relevant testing resources to fault-prone modules and hence increasing the business value of software projects. Most of the current software defect prediction studies utilize traditional supervised machine learning algorithms to predict defects in software applications. The software datasets utilized in such studies are imbalanced and therefore the reported results cannot be reliably used to judge their performance. Moreover, it is important to explain the output of machine learning models employed in fault-predication techniques to determine the contribution of each utilized feature to the model output. In this paper, we propose a new framework for predicting software defects utilizing eleven machine learning classifiers over twelve different datasets. For feature selection, we employ four different nature-inspired search algorithms, namely, particle swarm optimization, genetic algorithm, harmony algorithm, and ant colony optimization. Moreover, we make use of the synthetic minority oversampling technique (SMOTE) to address the problem of data imbalance. Furthermore, we utilize the Shapley additive explanation model for highlighting the highest determinative features. The obtained results demonstrate that gradient boosting, stochastic gradient boosting, decision trees, and categorical boosting outperform others tested model with over 90% accuracy and ROC-AUC. Additionally, we found that the ant colony optimization technique outperforms the other tested feature extraction techniques.
引用
收藏
页数:20
相关论文
共 57 条
  • [1] Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review
    Abu Alfeilat, Haneen Arafat
    Hassanat, Ahmad B. A.
    Lasassmeh, Omar
    Tarawneh, Ahmad S.
    Alhasanat, Mahmoud Bashir
    Salman, Hamzeh S. Eyal
    Prasath, V. B. Surya
    [J]. BIG DATA, 2019, 7 (04) : 221 - 248
  • [2] A Comprehensive Survey of the Harmony Search Algorithm in Clustering Applications
    Abualigah, Laith
    Diabat, Ali
    Geem, Zong Woo
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (11):
  • [3] Comprehensive Review of the Development of the Harmony Search Algorithm and Its Applications
    Al-Omoush, Ala'a A.
    Alsewari, Abdulrahman A.
    Alamri, Hammoudeh S.
    Zamli, Kamal Z.
    [J]. IEEE ACCESS, 2019, 7 : 14233 - 14245
  • [4] Default Prediction Model: The Significant Role of Data Engineering in the Quality of Outcomes
    Al-Qerem, Ahmad
    Al-Naymat, Ghazi
    Alhasan, Mays
    Al-Debei, Mutaz
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2020, 17 (4A) : 635 - 644
  • [5] Enhancing Small Medical Dataset Classification Performance Using GAN
    Alauthman, Mohammad
    Al-qerem, Ahmad
    Sowan, Bilal
    Alsarhan, Ayoub
    Eshtay, Mohammed
    Aldweesh, Amjad
    Aslam, Nauman
    [J]. INFORMATICS-BASEL, 2023, 10 (01):
  • [6] Tabular Data Generation to Improve Classification of Liver Disease Diagnosis
    Alauthman, Mohammad
    Aldweesh, Amjad
    Al-qerem, Ahmad
    Aburub, Faisal
    Al-Smadi, Yazan
    Abaker, Awad M. M.
    Alzubi, Omar Radhi
    Alzubi, Bilal
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (04):
  • [7] Ali U., 2020, Modern Education and Computer Science, V12, P29, DOI 10.5815/ijmecs.2020.05.03
  • [8] Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization
    Alibrahim, Hussain
    Ludwig, Simone A.
    [J]. 2021 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC 2021), 2021, : 1551 - 1559
  • [9] Alsaeedi Abdullah, 2019, Journal of Software Engineering and Applications, V12, P85, DOI DOI 10.4236/JSEA.2019.125007
  • [10] Feature selection using firefly algorithm in software defect prediction
    Anbu, M.
    Mala, G. S. Anandha
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 5): : 10925 - 10934