An ensemble model for addressing class imbalance and class overlap in software defect prediction

被引:1
作者
Dar, Abdul Waheed [1 ]
Farooq, Sheikh Umar [1 ]
机构
[1] Univ Kashmir, Dept Comp Sci, North Campus, Baramulla, India
关键词
Class imbalance problem; Class overlap problem; Machine learning; Over-sampling; Under-sampling; Software Defect Prediction; PERFORMANCE; CLASSIFICATION; MACHINE; SMOTE;
D O I
10.1007/s13198-024-02538-x
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Software defect prediction (SDP) is an important action and an emerging challenge in the process of software development that is used to increase the software quality. SDP identifies those modules of the software that are expected to contain defects, thereby helping to allocate the limited testing resources cost-efficiently so that the overall development cost is reduced. Various machine learning techniques have been utilised for developing SDP models. However, a major challenge to SDP models in identifying the software defective modules is the class imbalance problem of SDP datasets. Moreover, existing literature shows that the class overlap in imbalanced SDP datasets had a much negative impact on the prediction capability of SDP models. In this paper, we propose an effective ensemble SDP model that employs a four-stage pipeline approach to addresses both the problems of class overlap and class imbalance simultaneously. Our approach integrates the framework of class overlap reduction technique and under-sampling technique with the extreme gradient boosting classifier (XGBoost). Through this integrated approach, our model effectively handles both class overlap and class imbalance issues, providing an enhanced solution for SDP tasks. We assess the effectiveness of our proposed SDP model by comparing its performance against ten state-of-the-art SDP models using sixteen imbalanced software defect datasets. The experimental results, coupled with statistical analysis, indicate that our proposed SDP model exhibits superior predictive performance, surpassing the other ten benchmark models across various metrics such as recall, G-mean, F-measure, and AUC.
引用
收藏
页码:5584 / 5603
页数:20
相关论文
共 62 条
[1]  
[Anonymous], 2007, 1 INT S EMP SOFTW EN
[2]   MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].
Barua, Sukarna ;
Islam, Md. Monirul ;
Yao, Xin ;
Murase, Kazuyuki .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425
[3]   SMOTEFRIS-INFFC: Handling the challenge of borderline and noisy examples in imbalanced learning for software defect prediction [J].
Bashir, Kamal ;
Li, Tianrui ;
Yohannese, Chubato Wondaferaw ;
Yahaya, Mahama .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (01) :917-933
[4]  
Bekkar M., 2013, J Inf Eng Appl, V3, P27, DOI DOI 10.5121/IJDKP.2013.3402
[5]   MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction [J].
Benni, Kwabena Ebo ;
Keung, Jacky ;
Phannachitta, Passakorn ;
Monden, Akito ;
Mensah, Solomon .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2018, 44 (06) :534-550
[6]   The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification [J].
Bennin, Kwabena Ebo ;
Keung, Jacky ;
Monden, Akito ;
Phannachitta, Passakorn ;
Mensah, Solomon .
11TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT (ESEM 2017), 2017, :364-373
[7]   DConfusion: a technique to allow cross study performance evaluation of fault prediction studies [J].
Bowes, David ;
Hall, Tracy ;
Gray, David .
AUTOMATED SOFTWARE ENGINEERING, 2014, 21 (02) :287-313
[8]   A Survey of Predictive Modeling on Im balanced Domains [J].
Branco, Paula ;
Torgo, Luis ;
Ribeiro, Rita P. .
ACM COMPUTING SURVEYS, 2016, 49 (02)
[9]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[10]   Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem [J].
Catal, Cagatay ;
Diri, Banu .
INFORMATION SCIENCES, 2009, 179 (08) :1040-1058