Improving the undersampling technique by optimizing the termination condition for software defect prediction

被引:25
作者
Feng, Shuo [1 ]
Keung, Jacky [2 ]
Xiao, Yan [3 ,4 ]
Zhang, Peichang [5 ]
Yu, Xiao [6 ]
Cao, Xiaochun [3 ]
机构
[1] Zhengzhou Univ, Sch Comp & Artificial Intelligence, Zhengzhou, Peoples R China
[2] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
[3] Sun Yat Sen Univ, Sch Cyber Sci & Technol, Shenzhen, Peoples R China
[4] Natl Univ Singapore, Sch Comp, Singapore, Singapore
[5] Shenzhen Univ, Coll Elect & Informat Engn, Shenzhen, Peoples R China
[6] Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan, Peoples R China
关键词
Software defect prediction; Class imbalance; Learning-to-rank; Undersampling; Oversampling; Data resampling; DIFFERENTIAL EVOLUTION; EFFECT SIZE; SMOTE; QUALITY; METRICS; FAULTS;
D O I
10.1016/j.eswa.2023.121084
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The class imbalance problem significantly hinders the ability of the software defect prediction (SDP) models to distinguish between defective (minority class) and non-defective (majority class) software instances. Recent studies on the data resampling technique have shown that Random UnderSampling (RUS) is more effective than several complex oversampling techniques at alleviating this problem. However, RUS blindly removes majority class instances, leading to significant information loss. These studies have also pointed out that the conventional termination condition (i.e., terminating the data resampling technique when the number of instances for both the minority and majority classes are the same) of the data resampling technique can result in suboptimal performance. In fact, the undersampling technique can be likened to a recommender system or a web search engine that recommends majority class instances to SDP models. Therefore, we propose the Learning-To-Rank Undersampling technique (LTRUS). Our work is novel in two aspects: (1) We consider the undersampling process as a learning-to-rank task, optimizing a linear model to rank majority class instances and remove them from the bottom of the rank to alleviate the class imbalance problem. (2) We propose two termination conditions for the undersampling technique, which differ from the conventional termination condition. LTRUS significantly outperforms RUS, the clustering-based undersampling technique, the complexity-based oversampling technique, SMOTUNED, and Borderline-SMOTE in terms of F-measure, AUC, and MCC by 8.9%, 7.6%, and 18.0% on average under the conventional termination condition. Furthermore, LTRUS under the two termination conditions we propose yield similar performance, and both outperform LTRUS and all the other baselines under the conventional termination condition. The experimental results demonstrate the effectiveness of LTRUS and indicate that the conventional termination condition for the data resampling technique is improper.
引用
收藏
页数:13
相关论文
共 50 条
[21]   Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering [J].
Gong, Lina ;
Jiang, Shujuan ;
Jiang, Li .
IEEE ACCESS, 2019, 7 :145725-145737
[22]   Support Vector based Oversampling Technique for Handling Class Imbalance in Software Defect Prediction [J].
Malhotra, Ruchika ;
Agrawal, Vaibhav ;
Pal, Vedansh ;
Agarwal, Tushar .
2021 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING (CONFLUENCE 2021), 2021, :1078-1083
[23]   A research landscape on software defect prediction [J].
Taskeen, Anam ;
Khan, Saif Ur Rehman ;
Felix, Ebubeogu Amarachukwu .
JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2023, 35 (12)
[24]   Integrated Approach to Software Defect Prediction [J].
Felix, Ebubeogu Amarachukwu ;
Lee, Sai Peck .
IEEE ACCESS, 2017, 5 :21524-21547
[25]   A Systematic Review on Software Defect Prediction [J].
Singh, Pradeep Kumar ;
Agarwal, Dishti ;
Gupta, Aakriti .
2015 2ND INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT (INDIACOM), 2015, :1793-1797
[26]   Progress on approaches to software defect prediction [J].
Li, Zhiqiang ;
Jing, Xiao-Yuan ;
Zhu, Xiaoke .
IET SOFTWARE, 2018, 12 (03) :161-175
[27]   Software defect prediction algorithm for intra-membrane sparrow optimizing ELM [J].
Tang Y. ;
Dai Q. ;
Yang M. ;
Chen L. .
Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (02) :643-654
[28]   ROCT: Radius-based Class Overlap Cleaning Technique to Alleviate the Class Overlap Problem in Software Defect Prediction [J].
Feng, Shuo ;
Keung, Jacky ;
Liu, Jie ;
Xiao, Yan ;
Yu, Xiao ;
Zhang, Miao .
2021 IEEE 45TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2021), 2021, :228-237
[29]   An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data [J].
Malhotra, Ruchika ;
Kamal, Shine .
NEUROCOMPUTING, 2019, 343 :120-140
[30]   Researcher Bias: The Use of Machine Learning in Software Defect Prediction [J].
Shepperd, Martin ;
Bowes, David ;
Hall, Tracy .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2014, 40 (06) :603-616