An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data

被引:76
|
作者
Malhotra, Ruchika [1 ]
Kamal, Shine [1 ]
机构
[1] Delhi Technol Univ, Dept Comp Sci & Engn, Discipline Software Engn, Delhi, India
关键词
Defect prediction; Imbalanced data; Oversampling methods; MetaCost learners; Machine learning techniques; Procedural metrics; SAMPLING APPROACH; NEURAL-NETWORKS; CLASSIFICATION; SMOTE; QUALITY;
D O I
10.1016/j.neucom.2018.04.090
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software defect prediction is important to identify defects in the early phases of software development life cycle. This early identification and thereby removal of software defects is crucial to yield a cost-effective and good quality software product. Though, previous studies have successfully used machine learning techniques for software defect prediction, these techniques yield biased results when applied on imbalanced data sets. An imbalanced data set has non-uniform class distribution with very few instances of a specific class as compared to that of the other class. Use of imbalanced datasets leads to off-target predictions of the minority class, which is generally considered to be more important than the majority class. Thus, handling imbalanced data effectively is crucial for successful development of a competent defect prediction model. This study evaluates the effectiveness of machine learning classifiers for software defect prediction on twelve imbalanced NASA datasets by application of sampling methods and cost sensitive classifiers. We investigate five existing oversampling methods, which replicate the instances of minority class and also propose a new method SPIDER3 by suggesting modifications in SPIDER2 oversampling method. Furthermore, the work evaluates the performance of MetaCost learners for cost sensitive learning on imbalanced datasets. The results show improvement in the prediction capability of machine learning classifiers with the use of oversampling methods. Furthermore, the proposed SPIDER3 method shows promising results. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:120 / 140
页数:21
相关论文
共 50 条
  • [21] Oversampling-based prediction of environmental complaints related to construction projects with imbalanced empirical-data learning
    Hong, Juwon
    Kang, Hyuna
    Hong, Taehoon
    RENEWABLE & SUSTAINABLE ENERGY REVIEWS, 2020, 134
  • [22] Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction
    Balogun, Abdullateef O.
    Odejide, Babajide J.
    Bajeh, Amos O.
    Alanamu, Zubair O.
    Usman-Hamza, Fatima E.
    Adeleke, Hammid O.
    Mabayoje, Modinat A.
    Yusuff, Shakirat R.
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2022 WORKSHOPS, PART V, 2022, 13381 : 363 - 379
  • [23] An empirical study on software defect prediction with a simplified metric set
    He, Peng
    Li, Bing
    Liu, Xiao
    Chen, Jun
    Ma, Yutao
    INFORMATION AND SOFTWARE TECHNOLOGY, 2015, 59 : 170 - 190
  • [24] Traffic accident severity prediction based on oversampling and CNN for imbalanced data
    Shangguan, Anqi
    Mu, Lingxia
    Xie, Guo
    Wang, Chenglan
    Jing, Yang
    Fei, Rong
    Hei, Xinhong
    2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7004 - 7008
  • [25] An empirical study on the effectiveness of data resampling approaches for cross-project software defect prediction
    Bennin, Kwabena Ebo
    Tahir, Amjed
    MacDonell, Stephen G.
    Borstler, Jurgen
    IET SOFTWARE, 2022, 16 (02) : 185 - 199
  • [26] Improving Diagnostic Performance of High-Voltage Circuit Breakers on Imbalanced Data Using an Oversampling Method
    Chen, Lei
    Wan, Shuting
    Dou, Longjiang
    IEEE TRANSACTIONS ON POWER DELIVERY, 2022, 37 (04) : 2704 - 2716
  • [27] Collaborative filtering based recommendation of sampling methods for software defect prediction
    Sun, Zhongbin
    Zhang, Jingqi
    Sun, Heli
    Zhu, Xiaoyan
    APPLIED SOFT COMPUTING, 2020, 90
  • [28] Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems
    Khuat T.T.
    Le M.H.
    SN Computer Science, 2020, 1 (2)
  • [29] Empirical Study on Software Bug Prediction
    Rizwan, Syed
    Wang Tiantian
    Su Xiaohong
    Salahuddin
    2017 INTERNATIONAL CONFERENCE ON SOFTWARE AND E-BUSINESS (ICSEB 2017), 2015, : 55 - 59
  • [30] Oversampling Methods Combined Clustering and Data Cleaning for Imbalanced Network Data
    Yang, Yang
    Zhao, Qian
    Ruan, Linna
    Gao, Zhipeng
    Huo, Yonghua
    Qiu, Xuesong
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2020, 26 (05): : 1139 - 1155