An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data

被引:76
|
作者
Malhotra, Ruchika [1 ]
Kamal, Shine [1 ]
机构
[1] Delhi Technol Univ, Dept Comp Sci & Engn, Discipline Software Engn, Delhi, India
关键词
Defect prediction; Imbalanced data; Oversampling methods; MetaCost learners; Machine learning techniques; Procedural metrics; SAMPLING APPROACH; NEURAL-NETWORKS; CLASSIFICATION; SMOTE; QUALITY;
D O I
10.1016/j.neucom.2018.04.090
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software defect prediction is important to identify defects in the early phases of software development life cycle. This early identification and thereby removal of software defects is crucial to yield a cost-effective and good quality software product. Though, previous studies have successfully used machine learning techniques for software defect prediction, these techniques yield biased results when applied on imbalanced data sets. An imbalanced data set has non-uniform class distribution with very few instances of a specific class as compared to that of the other class. Use of imbalanced datasets leads to off-target predictions of the minority class, which is generally considered to be more important than the majority class. Thus, handling imbalanced data effectively is crucial for successful development of a competent defect prediction model. This study evaluates the effectiveness of machine learning classifiers for software defect prediction on twelve imbalanced NASA datasets by application of sampling methods and cost sensitive classifiers. We investigate five existing oversampling methods, which replicate the instances of minority class and also propose a new method SPIDER3 by suggesting modifications in SPIDER2 oversampling method. Furthermore, the work evaluates the performance of MetaCost learners for cost sensitive learning on imbalanced datasets. The results show improvement in the prediction capability of machine learning classifiers with the use of oversampling methods. Furthermore, the proposed SPIDER3 method shows promising results. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:120 / 140
页数:21
相关论文
共 50 条
  • [41] Improving Performance in Software Defect Prediction Using Variational Autoencoder
    Eivazpour, Z.
    Keyvanpour, Mohammad Reza
    2019 IEEE 5TH CONFERENCE ON KNOWLEDGE BASED ENGINEERING AND INNOVATION (KBEI 2019), 2019, : 644 - 649
  • [42] Oversampling Highly Imbalanced Indoor Positioning Data using Deep Generative Models
    Alhomayani, Fahad
    Mahoor, Mohammad H.
    2021 IEEE SENSORS, 2021,
  • [43] Influence-Balanced XGBoost: Improving XGBoost for Imbalanced Data Using Influence Functions
    Sutou, Akiyoshi
    Wang, Jinfang
    IEEE ACCESS, 2024, 12 : 193473 - 193486
  • [44] An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction
    Huda, Shamsul
    Liu, Kevin
    Abdelrazek, Mohamed
    Ibrahim, Amani
    Alyahya, Sultan
    Al-Dossari, Hmood
    Ahmad, Shafiq
    IEEE ACCESS, 2018, 6 : 24184 - 24195
  • [45] Evaluation of Risk Factors for Fall in Elderly People from Imbalanced Data using the Oversampling Technique SMOTE
    Sihag, Gulshan
    Yadav, Pankaj
    Delcroix, Veronique
    Vijay, Vivek
    Siebert, Xavier
    Yadav, Sandeep Kumar
    Puisieux, Francois
    ICT4AWE: PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES FOR AGEING WELL AND E-HEALTH, 2022, : 50 - 58
  • [46] Imbalanced Data Mining Using Oversampling and Cellular GEP Ensemble
    Jedrzejowicz, Joanna
    Jedrzejowicz, Piotr
    COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2021), 2021, 12876 : 360 - 372
  • [47] A novel preprocessing approach for imbalanced learning in software defect prediction
    Bashir, Kamal
    Li, Tianrui
    Yohannese, Chubato Wondaferaw
    Yahaya, Mahama
    Ali, Tayseer
    DATA SCIENCE AND KNOWLEDGE ENGINEERING FOR SENSING DECISION SUPPORT, 2018, 11 : 500 - 508
  • [48] Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study
    Balogun, Abdullateef O.
    Basri, Shuib
    Mahamad, Saipunidzam
    Abdulkadir, Said J.
    Almomani, Malek A.
    Adeyemo, Victor E.
    Al-Tashi, Qasem
    Mojeed, Hammed A.
    Imam, Abdullahi A.
    Bajeh, Amos O.
    SYMMETRY-BASEL, 2020, 12 (07):
  • [49] A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction
    Song, Qinbao
    Guo, Yuchen
    Shepperd, Martin
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2019, 45 (12) : 1253 - 1269
  • [50] Predicting defects in imbalanced data using resampling methods: an empirical investigation
    Malhotra, Ruchika
    Jain, Juhi
    PEERJ COMPUTER SCIENCE, 2022, 8