Handling class overlap and imbalance using overlap driven under-sampling with balanced random forest in software defect prediction

被引:1
作者
Dar, Abdul Waheed [1 ]
Farooq, Sheikh Umar [1 ]
机构
[1] Univ Kashmir, Dept Comp Sci, North Campus, Srinagar, India
关键词
Class imbalance problem; Machine learning; Software defect prediction; Over-sampling; Under-sampling; PERFORMANCE; MACHINE; SMOTE;
D O I
10.1007/s11334-024-00571-4
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Various techniques in machine learning have been used for building software defect prediction (SDP) models to identify the defective software modules. However, a major challenge to SDP models is the class overlapping and the class imbalance problem of SDP datasets. This study proposes a new SDP model that combines the overlap-based under-sampling framework with the balanced random forest classifier to improve the identification of defective software modules. First, the duplicate instances of the dataset are removed to avoid the over-fitting of the model. Next, the overlapped majority non-defective class instances of the training data are removed by applying an overlap-based under-sampling technique to maximize the presence of minority defective class instances in a region where the two classes overlap. Finally, we use the balanced random forest, which combines the random under-sampling and the ensemble learning techniques on the pre-processed training data for achieving the goal of classification prediction. The efficacy of our proposed SDP model is assessed by comparing its performance against nine state-of-the-art SDP models using 15 imbalanced software defect datasets. Experimental results and the statistical analysis indicate that our proposed SDP model has better predictive performance than other test models in terms of recall, G-mean, F-measure and AUC.
引用
收藏
页码:747 / 767
页数:21
相关论文
共 80 条
[71]   Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem [J].
Siers, Michael J. ;
Islam, Md Zahidul .
INFORMATION SYSTEMS, 2015, 51 :62-71
[72]  
Sonak A., 2015, Int J Comput Sci Mob Comput, V4, P338
[73]  
Stefanowski Jerzy., 2013, Emerging paradigms in machine learning, P277, DOI [DOI 10.1007/978-3-642-28699-511, 10.1007/978-3-642-28699-5_11]
[74]   Using Coding-Based Ensemble Learning to Improve Software Defect Prediction [J].
Sun, Zhongbin ;
Song, Qinbao ;
Zhu, Xiaoyan .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (06) :1806-1817
[75]   Prediction of Defective Software Modules Using Class Imbalance Learning [J].
Tomar, Divya ;
Agarwal, Sonali .
APPLIED COMPUTATIONAL INTELLIGENCE AND SOFT COMPUTING, 2016, 2016
[76]   An effective Weighted Multi-class Least Squares Twin Support Vector Machine for Imbalanced data classification [J].
Tomar, Divya ;
Agarwal, Sonali .
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2015, 8 (04) :761-778
[77]   Using Class Imbalance Learning for Software Defect Prediction [J].
Wang, Shuo ;
Yao, Xin .
IEEE TRANSACTIONS ON RELIABILITY, 2013, 62 (02) :434-443
[78]   Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models [J].
Wang, Shuo ;
Yao, Xin .
2009 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING, 2009, :324-331
[79]   A novel software defect prediction approach via weighted classification based on association rule mining [J].
Wu, Wentao ;
Wang, Shihai ;
Liu, Bin ;
Shao, Yuanxun ;
Xie, Wandong .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 129
[80]   Training cost-sensitive neural networks with methods addressing the class imbalance problem [J].
Zhou, ZH ;
Liu, XY .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (01) :63-77