An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data

被引:76
|
作者
Malhotra, Ruchika [1 ]
Kamal, Shine [1 ]
机构
[1] Delhi Technol Univ, Dept Comp Sci & Engn, Discipline Software Engn, Delhi, India
关键词
Defect prediction; Imbalanced data; Oversampling methods; MetaCost learners; Machine learning techniques; Procedural metrics; SAMPLING APPROACH; NEURAL-NETWORKS; CLASSIFICATION; SMOTE; QUALITY;
D O I
10.1016/j.neucom.2018.04.090
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software defect prediction is important to identify defects in the early phases of software development life cycle. This early identification and thereby removal of software defects is crucial to yield a cost-effective and good quality software product. Though, previous studies have successfully used machine learning techniques for software defect prediction, these techniques yield biased results when applied on imbalanced data sets. An imbalanced data set has non-uniform class distribution with very few instances of a specific class as compared to that of the other class. Use of imbalanced datasets leads to off-target predictions of the minority class, which is generally considered to be more important than the majority class. Thus, handling imbalanced data effectively is crucial for successful development of a competent defect prediction model. This study evaluates the effectiveness of machine learning classifiers for software defect prediction on twelve imbalanced NASA datasets by application of sampling methods and cost sensitive classifiers. We investigate five existing oversampling methods, which replicate the instances of minority class and also propose a new method SPIDER3 by suggesting modifications in SPIDER2 oversampling method. Furthermore, the work evaluates the performance of MetaCost learners for cost sensitive learning on imbalanced datasets. The results show improvement in the prediction capability of machine learning classifiers with the use of oversampling methods. Furthermore, the proposed SPIDER3 method shows promising results. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:120 / 140
页数:21
相关论文
共 50 条
  • [31] How Far Have We Progressed in the Sampling Methods for Imbalanced Data Classification? An Empirical Study
    Sun, Zhongbin
    Zhang, Jingqi
    Zhu, Xiaoyan
    Xu, Donghong
    ELECTRONICS, 2023, 12 (20)
  • [32] Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting
    Cahyana, Nurheri
    Khomsah, Siti
    Aribowo, Agus Sasmito
    2019 5TH INTERNATIONAL CONFERENCE ON SCIENCE ININFORMATION TECHNOLOGY (ICSITECH): EMBRACING INDUSTRY 4.0 - TOWARDS INNOVATION IN CYBER PHYSICAL SYSTEM, 2019, : 217 - 222
  • [33] Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction
    Feng, Shuo
    Keung, Jacky
    Yu, Xiao
    Xiao, Yan
    Zhang, Miao
    INFORMATION AND SOFTWARE TECHNOLOGY, 2021, 139
  • [34] A New Software Fault Prediction Model in Imbalanced Data
    Wang, Shi-Hai
    He, Ping
    2015 INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND INFORMATION SYSTEM (SEIS 2015), 2015, : 245 - 250
  • [35] Software Defect Prediction with Skewed Data
    Seliya, Naeem
    Khoshgoftaar, Taghi M.
    16TH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY AND QUALITY IN DESIGN, 2010, : 403 - +
  • [36] Prediction of Autism Spectrum Disorder Based on Imbalanced Resting-state fMRI Data Using Clustering Oversampling
    Yuan, Dan
    Zhu, Li
    Huang, Huifang
    TENTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2019, 2019, 11071
  • [37] An empirical study to investigate the impact of data resampling techniques on the performance of class maintainability prediction models
    Malhotra, Ruchika
    Lata, Kusum
    NEUROCOMPUTING, 2021, 459 : 432 - 453
  • [38] An Empirical Study on the Stability of Feature Selection for Imbalanced Software Engineering Data
    Wang, Huanjing
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 1, 2012, : 317 - 323
  • [39] Support Vector based Oversampling Technique for Handling Class Imbalance in Software Defect Prediction
    Malhotra, Ruchika
    Agrawal, Vaibhav
    Pal, Vedansh
    Agarwal, Tushar
    2021 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING (CONFLUENCE 2021), 2021, : 1078 - 1083
  • [40] Dealing with imbalanced data for interpretable defect prediction
    Gao, Yuxiang
    Zhu, Yi
    Zhao, Yu
    INFORMATION AND SOFTWARE TECHNOLOGY, 2022, 151