Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction

被引:79
作者
Goyal, Somya [1 ]
机构
[1] Manipal Univ Jaipur, Jaipur 303007, Rajasthan, India
关键词
Defect prediction; Class imbalance; Undersampling; Artificial Neural Networks (ANN); ROC; AUC; OVERLAP;
D O I
10.1007/s10462-021-10044-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software Defect Prediction (SDP) is highly crucial task in software development process to forecast about which modules are more prone to errors and faults before the instigation of the testing phase. It aims to reduce the development cost of the software by focusing the testing efforts to those predicted faulty modules. Though, it ensures in-time delivery of good quality end-product, but class-imbalance of dataset is a major hinderance to SDP. This paper proposes a novel Neighbourhood based Under-Sampling (N-US) algorithm to handle class imbalance issue. This work is dedicated to demonstrating the effectiveness of proposed Neighbourhood based Under-Sampling (N-US) approach to attain high accuracy while predicting the defective modules. The algorithm N-US under samples the dataset to maximize the visibility of minority data points while restricting the excessive elimination of majority data points to avoid information loss. To assess the applicability of N-US, it is compared with three standard under-sampling techniques. Further, this study investigates the performance of N-US as a trusted ally for SDP classifiers. Extensive experiments are conducted using benchmark datasets from NASA repository which are CM1, JM1, KC1, KC2 and PC1. The proposed SDP classifier with N-US technique is compared with baseline models statistically to assess the effectiveness of N-US algorithm for SDP. The proposed model outperforms the rest of the candidate SDP models with the highest AUC score (= 95.6%), the maximum Accuracy value (= 96.9%) and the closest ROC curve to the top left corner. It shows up with the best prediction power statistically with confidence level of 95%.
引用
收藏
页码:2023 / 2064
页数:42
相关论文
共 43 条
[1]  
[Anonymous], 2008, TESTING STAT HYPOTHE
[2]  
[Anonymous], 2005, The PROMISE Repository of Software Engineering Databases
[3]   Software metrics thresholds calculation techniques to predict fault-proneness: An empirical comparison [J].
Boucher, Alexandre ;
Badri, Mourad .
INFORMATION AND SOFTWARE TECHNOLOGY, 2018, 96 :38-67
[4]   An under-sampled software defect prediction method based on hybrid multi-objective cuckoo search [J].
Cai, Xingjuan ;
Niu, Yun ;
Geng, Shaojin ;
Zhang, Jiangjiang ;
Cui, Zhihua ;
Li, Jianwei ;
Chen, Jinjun .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (05)
[5]   "Sampling" as a Baseline Optimizer for Search-Based Software Engineering [J].
Chen, Jianfeng ;
Nair, Vivek ;
Krishna, Rahul ;
Menzies, Tim .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2019, 45 (06) :597-614
[6]   Tackling class overlap and imbalance problems in software defect prediction [J].
Chen, Lin ;
Fang, Bin ;
Shang, Zhaowei ;
Tang, Yuanyan .
SOFTWARE QUALITY JOURNAL, 2018, 26 (01) :97-125
[7]   A comparison of some soft computing methods for software fault prediction [J].
Erturk, Ezgi ;
Sezer, Ebru Akcapinar .
EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (04) :1872-1879
[8]   Systematic literature review of preprocessing techniques for imbalanced data [J].
Felix, Ebubeogu Amarachukwu ;
Lee, Sai Peck .
IET SOFTWARE, 2019, 13 (06) :479-496
[9]   A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches [J].
Galar, Mikel ;
Fernandez, Alberto ;
Barrenechea, Edurne ;
Bustince, Humberto ;
Herrera, Francisco .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (04) :463-484
[10]   An improved transfer adaptive boosting approach for mixed-project defect prediction [J].
Gong, Lina ;
Jiang, Shujuan ;
Jiang, Li .
JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2019, 31 (10)