Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study

被引:8
作者
Pandey, Sushant Kumar [1 ]
Tripathi, Anil Kumar [1 ]
机构
[1] Banaras Hindu Univ, Indian Inst Technol, Dept Comp Sci & Engn, Varanasi, Uttar Pradesh, India
来源
2021 8TH INTERNATIONAL CONFERENCE ON SMART COMPUTING AND COMMUNICATIONS (ICSCC) | 2021年
关键词
Software fault prediction; Class imbalance; Machine learning; Software metrics; Statistical methods;
D O I
10.1109/ICSCC51209.2021.9528170
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Software practitioners are continuing to build advanced software defect prediction (SDP) models to help the tester find fault-prone modules. However, the Class Imbalance (CI) problem consists of uncommonly few defective instances, and more non-defective instances cause inconsistency in the performance. We have conducted 880 experiments to analyze the variation in the performance of 10 SDP models by concerning the class imbalance problem. In our experiments, we have used 22 public datasets consists of 41 software metrics, 10 baseline SDP methods, and 4 sampling techniques. We used Mathews Correlation Coefficient (MCC), which is more useful when a dataset is highly imbalanced. We have also compared the predictive performance of various ML models by applying 4 sampling techniques. To examine the performance of different SDP models, we have used the F-measure. We found the performance of the learning models is unsatisfactory, which needs to mitigate. We have also found a few surprising results, some logical patterns between classifier and sampling technique. It provides a connection between sampling technique, software matrices, and a classifier.
引用
收藏
页码:58 / 63
页数:6
相关论文
共 28 条
[11]   COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction [J].
Feng, Shuo ;
Keung, Jacky ;
Yu, Xiao ;
Xiao, Yan ;
Bennin, Kwabena Ebo ;
Kabir, Md Alamgir ;
Zhang, Miao .
INFORMATION AND SOFTWARE TECHNOLOGY, 2021, 129
[12]   A decision-theoretic generalization of on-line learning and an application to boosting [J].
Freund, Y ;
Schapire, RE .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) :119-139
[13]  
Halstead Maurice Howard, 1977, Elements of software science, V7
[14]   Learning from Imbalanced Data [J].
He, Haibo ;
Garcia, Edwardo A. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (09) :1263-1284
[15]  
Holte R. C., 1989, IJCAI-89 Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, P813
[16]   Sample-based software defect prediction with active and semi-supervised learning [J].
Li, Ming ;
Zhang, Hongyu ;
Wu, Rongxin ;
Zhou, Zhi-Hua .
AUTOMATED SOFTWARE ENGINEERING, 2012, 19 (02) :201-230
[17]   DIVERGENCE MEASURES BASED ON THE SHANNON ENTROPY [J].
LIN, JH .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1991, 37 (01) :145-151
[18]  
Liu AlexanderY., 2004, The Effect of Oversampling and Undersampling on Classifying Imbalanced Text Datasets
[19]  
Pandey Sushant Kumar, 2018, Procedia Computer Science, V132, P1412, DOI 10.1016/j.procs.2018.05.071
[20]   Machine learning based methods for software fault prediction: A survey [J].
Pandey, Sushant Kumar ;
Mishra, Ravi Bhushan ;
Tripathi, Anil Kumar .
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 172