Cross project project defect prediction using class distribution estimation and oversampling

被引:44
作者
Limsettho, Nachai [1 ]
Bennin, Kwabena Ebo [2 ]
Keung, Jacky W. [2 ]
Hata, Hideaki [1 ]
Matsumoto, Kenichi [1 ]
机构
[1] Nara Inst Sci & Technol, Grad Sch Sci & Technol, Ikoma, Japan
[2] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
关键词
Cross-Project defect prediction; Software fault prediction; Oversampling; Class imbalance learning; Class distribution estimation;
D O I
10.1016/j.infsof.2018.04.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context Cross-project defect prediction (CPDP) which uses dataset from other projects to build predictors has been recently recommended as an effective approach for building prediction models that lack historical or sufficient local datasets. Class imbalance and distribution mismatch between the source and target datasets associated with real-world defect datasets are known to have a negative impact on prediction performance. Objective: To alleviate the negative effects of class imbalance and distribution mismatch on performance of CPDP models by using Class Distribution Estimation and Synthetic Minority Oversampling Technique. A novel approach called Class Distribution Estimation with Synthetic Minority Oversampling Technique (CDE-SMOTE) is proposed to optimize and improve the CPDP performance and avoid excessive oversampling. Method: The proposed CDE-SMOTE employs CDE to estimate the class distribution of the target project. SMOTE is then used to modify the class distribution of the training data until the distribution becomes the reverse of the approximated class distribution of the target project. Four comprehensive experiments are conducted on 14 open source software projects. Results: The proposed approach improves the overall performance of CPDP models when compared to the performance of other CPDP approaches. Significant improvements are observed in 63% of the test cases according to the Wilcoxon signed-rank tests with 16.421%, 29.687% and 20.259% improvements in terms of Balance, G-measure, and F-measure, respectively. Application of CDE-SMOTE on NN-filtered datasets significantly improved prediction performance. Conclusions: CDE-SMOTE mitigates the class imbalance and distribution mismatch problems and also helps prevents excessive oversampling that results in performance degradation of prediction models. This approach is thus recommended for CPDP studies in software engineering.
引用
收藏
页码:87 / 102
页数:16
相关论文
共 54 条
[1]  
Abu Shanab A, 2012, 2012 IEEE 13TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P415, DOI 10.1109/IRI.2012.6303039
[2]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[3]  
[Anonymous], 2009, SIGKDD Explorations, DOI DOI 10.1145/1656274.1656278
[4]  
[Anonymous], 2004, COMBINING PATTERN CL, DOI DOI 10.1002/0471660264
[5]  
[Anonymous], 1993, MORGAN KAUFMANN SERI
[6]  
[Anonymous], P 11 INT S EMP SOFTW
[7]  
[Anonymous], 2017, IEEE T SOFTWARE ENG
[8]   Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models [J].
Bennin, Kwabena Ebo ;
Keung, Jacky ;
Monden, Akito .
2017 24TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE (APSEC 2017), 2017, :630-635
[9]   Empirical Evaluation of Cross-Release Effort-Aware Defect Prediction Models [J].
Bennin, Kwabena Ebo ;
Toda, Koji ;
Kamei, Yasutaka ;
Keung, Jacky ;
Monden, Akito ;
Ubayashi, Naoyasu .
2016 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2016), 2016, :214-221
[10]   Investigating the Effects of Balanced Training and Testing Datasets on Effort-Aware Fault Prediction Models [J].
Bennin, Kwabena Ebo ;
Keung, Jacky ;
Monden, Akito ;
Kamei, Yasutaka ;
Ubayashi, Naoyasu .
PROCEEDINGS 2016 IEEE 40TH ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE WORKSHOPS, VOL 1, 2016, :154-163