IH:mpirical Evaluation of the Impact of Class Overlap on Software Defect Prediction

被引:32
作者
Gong, Lina [1 ,2 ,3 ]
Jiang, Shujuan [1 ,2 ]
Wang, Rongcun [1 ,2 ]
Jiang, Li [1 ,2 ]
机构
[1] China Univ Min & Technol, Sch Comp Sci & Technol, Xuzhou 221116, Jiangsu, Peoples R China
[2] Minist Educ, Mine Digitizat Engn Res Ctr, Xuzhou 221116, Jiangsu, Peoples R China
[3] Zaozhuang Univ, Dept Informat Sci & Engn, Zaozhuang 277160, Peoples R China
来源
34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019) | 2019年
基金
中国国家自然科学基金;
关键词
Class overlap; Software defect prediction; K Means clustering; Machine learning; MACHINE;
D O I
10.1109/ASE.2019.00071
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Software defect prediction (SDP) utilizes the learning models to detect the defective modules in project, and their performance depends on the quality of training data. The previous researches mainly focus on the quality problems of class imbalance and feature redundancy. However, training data often contains some instances that belong to different class but have similar values on features, and this leads to class overlap to affect the quality of training data. Our goal is to investigate the impact of class overlap on software defect prediction. At the same time, we propose an improved K-Means clustering cleaning approach (IKMCCA) to solve both the class overlap and class imbalance problems. Specifically, we check whether K Means clustering cleaning approach (KMCCA) or neighborhood cleaning learning (NCL) or IKMCCA is feasible to improve defect detection performance for two cases (i) within -project defect prediction (WPDP) (ii) cross -project defect prediction (CPDP). To have an objective estimate of class overlap, we carry out our investigations on 28 open source projects, and compare the performance of state-of-the-art learning models for the above mentioned cases by using IKMCCA or KMCCA or NCL VS. without cleaning data. The experimental results make clear that learning models obtain significantly better performance in terms of balance, Recall and AUC for both WPDP and CPDP when the overlapping instances are removed. Moreover, it is better to consider both class overlap and class imbalance.
引用
收藏
页码:710 / 721
页数:12
相关论文
共 38 条
  • [1] [Anonymous], 2015, HIGH ORDER CONSERVAT
  • [2] Tackling class overlap and imbalance problems in software defect prediction
    Chen, Lin
    Fang, Bin
    Shang, Zhaowei
    Tang, Yuanyan
    [J]. SOFTWARE QUALITY JOURNAL, 2018, 26 (01) : 97 - 125
  • [3] Negative samples reduction in cross-company software defects prediction
    Chen, Lin
    Fang, Bin
    Shang, Zhaowei
    Tang, Yuanyan
    [J]. INFORMATION AND SOFTWARE TECHNOLOGY, 2015, 62 : 67 - 77
  • [4] MULTI: Multi-objective effort-aware just-in-time software defect prediction
    Chen, Xiang
    Zhao, Yingquan
    Wang, Qiuping
    Yuan, Zhidan
    [J]. INFORMATION AND SOFTWARE TECHNOLOGY, 2018, 93 : 1 - 13
  • [5] Evaluating defect prediction approaches: a benchmark and an extensive comparison
    D'Ambros, Marco
    Lanza, Michele
    Robbes, Romain
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2012, 17 (4-5) : 531 - 577
  • [6] THE MEANING AND USE OF THE AREA UNDER A RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE
    HANLEY, JA
    MCNEIL, BJ
    [J]. RADIOLOGY, 1982, 143 (01) : 29 - 36
  • [7] An investigation on the feasibility of cross-project defect prediction
    He, Zhimin
    Shu, Fengdi
    Yang, Ye
    Li, Mingshu
    Wang, Qing
    [J]. AUTOMATED SOFTWARE ENGINEERING, 2012, 19 (02) : 167 - 199
  • [8] A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches
    Herbold, Steffen
    Trautsch, Alexander
    Grabowski, Jens
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2018, 44 (09) : 811 - 833
  • [9] HRIBAR L, 2010, INF SOFTW TECHNOL, V58, P388
  • [10] Software Defect Prediction using Feature Selection and Random Forest Algorithm
    Ibrahim, Dyana Rashid
    Ghnemat, Rawan
    Hudaib, Amjad
    [J]. 2017 INTERNATIONAL CONFERENCE ON NEW TRENDS IN COMPUTING SCIENCES (ICTCS), 2017, : 252 - 257