Feature Selection Techniques to Counter Class Imbalance Problem for Aging Related Bug Prediction Aging Related Bug Prediction

被引:13
作者
Kumar, Lov [1 ]
Sureka, Ashish [2 ]
机构
[1] Thapar Univ, Patiala, Punjab, India
[2] Ashoka Univ, Sonepat, Haryana, India
来源
ISEC'18: PROCEEDINGS OF THE 11TH INNOVATIONS IN SOFTWARE ENGINEERING CONFERENCE | 2018年
关键词
Aging Related Bugs; Imbalance Learning; Empirical Software Engineering; Feature Selection Techniques; Machine Learning; Predictive Modeling; Software Maintenance; Source Code Metrics; CLASSIFICATION; COMPLEXITY;
D O I
10.1145/3172871.3172872
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Aging-Related Bugs (ARBs) occur in long running systems due to error conditions caused because of accumulation of problems such as memory leakage or unreleased files and locks. Aging-Related Bugs are hard to discover during software testing and also challenging to replicate. Automatic identification and prediction of aging related fault-prone files and classes in an object oriented system can help the software quality assurance team to optimize their testing efforts. In this paper, we present a study on the application of static source code metrics and machine learning techniques to predict aging related bugs. We conduct a series of experiments on publicly available dataset from two large open-source software systems: Linux and MySQL. Class imbalance and high dimensionality are the two main technical challenges in building effective predictors for aging related bugs. We investigate the application of five different feature selection techniques (OneR, Information Gain, Gain Ratio, RELEIF and Symmetric Uncertainty) for dimensionality reduction and five different strategies (Random Under-sampling, Random Oversampling, SMOTE, SMOTEBoost and RUSBoost) to counter the effect of class imbalance in our proposed machine learning based solution approach. Experimental results reveal that the random under-sampling approach performs best followed by RUSBoost in-terms of the mean AUC metric. Statistical significance test demonstrates that there is a significant difference between the performance of the various feature selection techniques. Experimental results shows that Gain Ratio and RELEIF performs best in comparison to other strategies to address the class imbalance problem. We infer from the statistical significance test that there is no difference between the performances of the five different learning algorithms.
引用
收藏
页数:11
相关论文
共 27 条
  • [1] [Anonymous], IEEE T SYSTEMS MAN A
  • [2] [Anonymous], 1999, Ph.D. Thesis
  • [3] [Anonymous], 2015, PROMISE REPOSITORY E
  • [4] Analysis and Prediction of Mandelbugs in an Industrial Software System
    Carrozza, Gabriella
    Cotroneo, Domenico
    Natella, Roberto
    Pietrantuono, Roberto
    Russo, Stefano
    [J]. 2013 IEEE SIXTH INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION (ICST 2013), 2013, : 262 - 271
  • [5] Caruana R., 2006, P 23 INT C MACH LEAR, P161, DOI [10.1145/1143844.1143865, DOI 10.1145/1143844.1143865]
  • [6] Chawla NV, 2005, DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK, P853, DOI 10.1007/0-387-25465-X_40
  • [7] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [8] SMOTEBoost: Improving prediction of the minority class in boosting
    Chawla, NV
    Lazarevic, A
    Hall, LO
    Bowyer, KW
    [J]. KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 : 107 - 119
  • [9] Cotroneo Domenico, 2010, Proceedings of the 2010 IEEE 21st International Symposium on Software Reliability Engineering (ISSRE 2010), P71, DOI 10.1109/ISSRE.2010.24
  • [10] Cotroneo D., 2010 IEEE 2 INT WORK, V2010, P1