Diversity based imbalance learning approach for software fault prediction using machine learning models

被引:26
作者
Manchala, Pravali [1 ]
Bisi, Manjubala [1 ]
机构
[1] Natl Inst Technol Warangal, Dept Comp Sci & Engn, Hanamkonda, Telangana, India
关键词
Imbalance learning; Software fault prediction; Oversampling; Machine Learning Model; Deep Neural Network; SUPPORT VECTOR MACHINE; DEFECT PREDICTION; SAMPLING METHOD; SMOTE; CLASSIFICATION;
D O I
10.1016/j.asoc.2022.109069
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Software fault prediction (SFP) target is to distinguish between faulty and non-faulty modules. The prediction model's performance is vulnerable to the class imbalance issue in SFP. The existing oversampling approaches generate relatively identical synthetic data, which results in overgeneralization and less diverse data. Moreover, many undesirable noisy modules are introduced while generating synthetic data. In this study, we propose the Weighted Average Centroid based Imbalance Learning Approach (WACIL), an effective synthetic over-sampling technique to mitigate the imbalance issue. The WACIL first finds borderline instances, then generates pseudo-data of them through a weighted average centroid concept and filters out inappropriate noise data through a filtration process. We conducted experiments on 24 PROMISE and NASA projects and compared them with some of the existing sampling approaches using K-Nearest Neighbors (KNN), Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT) and Deep Neural Network (DNN) as classification models. WACIL achieves superior results in terms of Fall Out Rate (FOR), F-measure and Area Under Curve (AUC) and obtains comparable results in terms of Recall and G-mean compared to the competitive approaches. The statistical analysis indicates that WACIL's ability to outperform the other over-sampling techniques is significant under the statistical Wilcoxon signed rank test and matched pairs rank biserial correlation coefficient effect size. Hence, WACIL is advisable as a competent choice to deal with the imbalance issue in SFP. (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:17
相关论文
共 65 条
[11]  
Chawla NV., 2004, ACM SIGKDD EXPLORATI, V6, P1, DOI [10.1145/1007730.1007733, DOI 10.1145/1007730.1007733]
[12]   RAMOBoost: Ranked Minority Oversampling in Boosting [J].
Chen, Sheng ;
He, Haibo ;
Garcia, Edwardo A. .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2010, 21 (10) :1624-1642
[13]  
Demsar J, 2006, J MACH LEARN RES, V7, P1
[14]   Class Imbalance oriented Logistic Regression [J].
Dong, Yadong ;
Guo, Huaping ;
Zhi, Weimei ;
Fan, Ming .
2014 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY (CYBERC), 2014, :187-192
[15]   Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE [J].
Douzas, Georgios ;
Bacao, Fernando .
INFORMATION SCIENCES, 2019, 501 :118-135
[16]  
Fawcett T., 2004, MACH LEARN, V31, P1
[17]   Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction [J].
Feng, Shuo ;
Keung, Jacky ;
Yu, Xiao ;
Xiao, Yan ;
Zhang, Miao .
INFORMATION AND SOFTWARE TECHNOLOGY, 2021, 139
[18]   COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction [J].
Feng, Shuo ;
Keung, Jacky ;
Yu, Xiao ;
Xiao, Yan ;
Bennin, Kwabena Ebo ;
Kabir, Md Alamgir ;
Zhang, Miao .
INFORMATION AND SOFTWARE TECHNOLOGY, 2021, 129
[19]   On the effectiveness of preprocessing methods when dealing with different levels of class imbalance [J].
Garcia, V. ;
Sanchez, J. S. ;
Mollineda, R. A. .
KNOWLEDGE-BASED SYSTEMS, 2012, 25 (01) :13-21
[20]  
Gholamy A., 2018, WHY 70 30 80 20 RELA