Two staged data preprocessing ensemble model for software fault prediction

被引:4
作者
Elahi, Ehsan [1 ]
Ayub, Amber [2 ]
Hussain, Irfan [3 ]
机构
[1] COMSATS Univ Islamabad, Dept Comp Sci, Islamabad, Pakistan
[2] Air Univ, Dept Comp Sci, Multan Campus, Multan, Pakistan
[3] GIK Inst Engn Sci & Technol, Fac Comp Sci & Engn, Swabi, Pakistan
来源
PROCEEDINGS OF 2021 INTERNATIONAL BHURBAN CONFERENCE ON APPLIED SCIENCES AND TECHNOLOGIES (IBCAST) | 2021年
关键词
Software fault prediction; random oversampling; ensemble method; class overlapping; DEFECT PREDICTION; CLASSIFICATION; FRAMEWORK;
D O I
10.1109/IBCAST51254.2021.9393182
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Software fault prediction is an essential task for the researchers and software testers to determine the faulty modules in the software in early stages. This early identification of faulty modules improves the software quality and thus the software produced will be of higher quality and cost effective. The use of imbalanced dataset hinders in the performance of the software fault prediction model. The model gets biased towards the majority class and thus the worthy results may not be produced. Moreover, the class overlap problem in the data results in the incorrect prediction. This class overlap problem needs to be addressed as the available datasets are highly imbalanced and overlapped. Many fault predictions models have been proposed in the literature using machine learning classifiers but there is always a room for improvement. In this study, the main objective is to utilize the balanced and non-overlapping data in the training of our model, thus improving the prediction capability of the model. In this study, we have used the two staged preprocessing of the dataset before training of our model. Firstly, class overlap problem is addressed using neighborhood cleaning method and then secondly, data is balanced using random oversampling technique. Five publicly available datasets from PROMISE repository are utilized in this study. The four base learners are used and then the results of these base learners are ensembled using the model averaging method. The results are then compared with the use of overlapping method only and using the resampling technique only, to determine the usefulness of the proposed approach. Moreover, the results of the proposed approach are also compared with the existing approach of handling imbalanced data. Through experiments it is seen that the proposed technique has outperformed the prediction capability. For evaluation purpose, the performance measure used is area under the curve (AUC). To avoid the randomness and biasness, results are cross validated using k-fold (k = 10) cross validation.
引用
收藏
页码:506 / 511
页数:6
相关论文
共 23 条
  • [11] Khuat, 2019, INT J ELECTR COMPUT, V9, P3241, DOI 10.11591/ijece.v9i4.pp3241-3246
  • [12] Exploiting the Essential Assumptions of Analogy-Based Effort Estimation
    Kocaguneli, Ekrem
    Menzies, Tim
    Bener, Ayse Basar
    Keung, Jacky W.
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2012, 38 (02) : 425 - 438
  • [13] Benchmarking classification models for software defect prediction: A proposed framework and novel findings
    Lessmann, Stefan
    Baesens, Bart
    Mues, Christophe
    Pietsch, Swantje
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2008, 34 (04) : 485 - 496
  • [14] IH:mpirical Evaluation of the Impact of Class Overlap on Software Defect Prediction
    Gong, Lina
    Jiang, Shujuan
    Wang, Rongcun
    Jiang, Li
    [J]. 34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019), 2019, : 710 - 721
  • [15] Empirical Studies of a Two-Stage Data Preprocessing Approach for Software Fault Prediction
    Liu, Wangshu
    Liu, Shulong
    Gu, Qing
    Chen, Jiaqiang
    Chen, Xiang
    Chen, Daoxu
    [J]. IEEE TRANSACTIONS ON RELIABILITY, 2016, 65 (01) : 38 - 53
  • [16] An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics
    Lopez, Victoria
    Fernandez, Alberto
    Garcia, Salvador
    Palade, Vasile
    Herrera, Francisco
    [J]. INFORMATION SCIENCES, 2013, 250 : 113 - 141
  • [17] A systematic review of machine learning techniques for software fault prediction
    Malhotra, Ruchika
    [J]. APPLIED SOFT COMPUTING, 2015, 27 : 504 - 518
  • [18] Defect prediction from static code features: current results, limitations, new approaches
    Menzies, Tim
    Milton, Zach
    Turhan, Burak
    Cukic, Bojan
    Jiang, Yue
    Bener, Ayse
    [J]. AUTOMATED SOFTWARE ENGINEERING, 2010, 17 (04) : 375 - 407
  • [19] An Empirical Study on Software Defect Prediction Using Over-Sampling by SMOTE
    Pak, Cholmyong
    Wang, Tian Tian
    Su, Xiao Hong
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2018, 28 (06) : 811 - 830
  • [20] A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction
    Song, Qinbao
    Guo, Yuchen
    Shepperd, Martin
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2019, 45 (12) : 1253 - 1269