Heterogeneous defect prediction with two-stage ensemble learning

被引:49
作者
Li, Zhiqiang [1 ,2 ]
Jing, Xiao-Yuan [2 ,3 ]
Zhu, Xiaoke [4 ]
Zhang, Hongyu [5 ]
Xu, Baowen [6 ]
Ying, Shi [2 ]
机构
[1] Shaanxi Normal Univ, Sch Comp Sci, Xian 710119, Shaanxi, Peoples R China
[2] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Hubei, Peoples R China
[3] Nanjing Univ Posts & Telecommun, Sch Automat, Nanjing 210023, Jiangsu, Peoples R China
[4] Henan Univ, Sch Comp & Informat Engn, Kaifeng 475001, Peoples R China
[5] Univ Newcastle, Sch Elect Engn & Comp, Callaghan, NSW 2308, Australia
[6] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210093, Jiangsu, Peoples R China
关键词
Heterogeneous defect prediction; Two-stage ensemble learning; Linear inseparability; Multiple kernel learning; Class imbalance; Data sampling; Domain adaptation; STATIC CODE ATTRIBUTES; CLASSIFICATION; MACHINE; MODELS;
D O I
10.1007/s10515-019-00259-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Heterogeneous defect prediction (HDP) refers to predicting defect-prone software modules in one project (target) using heterogeneous data collected from other projects (source). Recently, several HDP methods have been proposed. However, these methods do not sufficiently incorporate the two characteristics of the defect data: (1) data could be linear inseparable, and (2) data could be highly imbalanced. These two data characteristics make it challenging to build an effective HDP model. In this paper, we propose a novel Two-Stage Ensemble Learning (TSEL) approach to HDP, which contains two stages: ensemble multi-kernel domain adaptation (EMDA) stage and ensemble data sampling (EDS) stage. In the EMDA stage, we develop an Ensemble Multiple Kernel Correlation Alignment (EMKCA) predictor, which combines the advantage of multiple kernel learning and domain adaptation techniques. In the EDS stage, we employ RESample with replacement (RES) technique to learn multiple different EMKCA predictors and use average ensemble to combine them together. These two stages create an ensemble of defect predictors. Extensive experiments on 30 public projects show that the proposed TSEL approach outperforms a range of competing methods. The improvement is 20.14-33.92% in AUC, 36.05-54.78% in f-measure, and 5.48-19.93% in balance, respectively.
引用
收藏
页码:599 / 651
页数:53
相关论文
共 80 条
  • [1] Al Noor S, 2016, IEEE INT CONF CLOUD, P172, DOI [10.1109/CLOUD.2016.30, 10.1109/CLOUD.2016.0032]
  • [2] Heterogeneous Defect Prediction
    Nam, Jaechang
    Fu, Wei
    Kim, Sunghun
    Menzies, Tim
    Tan, Lin
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2018, 44 (09) : 874 - 896
  • [3] [Anonymous], 2014, ARXIV PREPRINT ARXIV
  • [4] [Anonymous], 2010, KERNEL METHODS NONLI
  • [5] [Anonymous], PROC INT CONF SOFTW
  • [6] [Anonymous], PROC INT CONF SOFTW
  • [7] [Anonymous], AUTOMAT SOFTW ENG
  • [8] [Anonymous], 2018, IEEE T SOFTWARE ENG, DOI DOI 10.1109/TSE.2017.2724538
  • [9] [Anonymous], 2018, IEEE T SOFTW ENG
  • [10] [Anonymous], 2012, P ACM SIGSOFT 20 INT