Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data

被引：8

作者：

Benala, Tirimula Rao ^{[1
]}

Tantati, Karunya ^{[1
]}

机构：

[1] Jawaharlal Nehru Technol Univ, JNTU GV Coll Engn, Dept Informat Technol, Gurajada Vizianagaram 535003, Andhra Pradesh, India

来源：

INNOVATIONS IN SYSTEMS AND SOFTWARE ENGINEERING | 2023年 / 19卷 / 03期

关键词：

Software defect prediction; Machine learning classifiers; Oversampling methods; Imbalanced datasets; SMOTE;

D O I：

10.1007/s11334-022-00457-3

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Software defect prediction (SDP) is essential to analyze and identify defects present in a software model in early stages of software development. The identification of these defects and their early removal provides cost-efficient software. Machine learning (ML) techniques have been successfully used for developing defect prediction models. However, these techniques deliver off-target results when implemented on imbalanced datasets. For example, a dataset with unequal class distribution is technically imbalanced. Thus, ML techniques on such imbalanced data lead to a biased prediction of minority class instances, which are more important than majority class instances. Therefore, the imbalanced data problem must be resolved to successfully develop an efficient SDP model. In this study, we evaluated the prediction capability of ML classifiers for software defect prediction on nine imbalanced NASA datasets by applying oversampling methods. In addition, we considered five oversampling methods to synthesize minority class instances and make the datasets balanced. Dataset imbalance was eliminated using the five oversampling techniques. The oversampling techniques replicated or synthesized the instances of minority classes to balance the datasets. When the datasets were balanced, the ML classifiers were used to develop a defect prediction model. The experimental results acquired by applying ML classifiers on the imbalanced and balanced data showed an enhancement in the learning capability of ML techniques with the implementation of sampling techniques. Oversampling methods considerably improved the prediction performance of the ML classifiers.

引用

页码：247 / 263

页数：17

共 42 条

[1]

Abu Shanab A, 2012, 2012 IEEE 13TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P415, DOI 10.1109/IRI.2012.6303039

[2] Is "Better Data" Better Than "Better Data Miners"? On the Benefits of Tuning SMOTE for Defect Prediction [J].

Agrawal, Amritanshu ;

Menzies, Tim .

PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, :1050-1061

[3] MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction [J].

Benni, Kwabena Ebo ;

Keung, Jacky ;

Phannachitta, Passakorn ;

Monden, Akito ;

Mensah, Solomon .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2018, 44 (06) :534-550

[4] On the relative value of data resampling approaches for software defect prediction [J].

Bennin, Kwabena Ebo ;

Keung, Jacky W. ;

Monden, Akito .

EMPIRICAL SOFTWARE ENGINEERING, 2019, 24 (02) :602-636

[5]

Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1007/BF00058655

[6] A systematic study of the class imbalance problem in convolutional neural networks [J].

Buda, Mateusz ;

Maki, Atsuto ;

Mazurowski, Maciej A. .

NEURAL NETWORKS, 2018, 106 :249-259

[7]

Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43

[8] SMOTE: Synthetic minority over-sampling technique [J].

Chawla, Nitesh V. ;

Bowyer, Kevin W. ;

Hall, Lawrence O. ;

Kegelmeyer, W. Philip .

2002, American Association for Artificial Intelligence (16)

[9] Learning from imbalanced data in surveillance of nosocomial infection [J].

Cohen, Gilles ;

Hilario, Melanie ;

Sax, Hugo ;

Hugonnet, Stephane ;

Geissbuhler, Antoine .

ARTIFICIAL INTELLIGENCE IN MEDICINE, 2006, 37 (01) :7-18

[10]

Drummond C., 2003, Proceedings of the International Conference on Machine Learning (ICML 2003) Workshop on Learning from Imbalanced Data Sets II, P1

← 1 2 3 4 5 →