On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect Prediction

被引:87
作者
Li, Zhiqiang [1 ]
Jing, Xiao-Yuan [1 ,2 ]
Zhu, Xiaoke [1 ,3 ]
Zhang, Hongyu [4 ]
Xu, Baowen [1 ]
Ying, Shi [1 ]
机构
[1] Wuhan Univ, State Key Lab Software Engn, Sch Comp, Wuhan 430072, Hubei, Peoples R China
[2] Nanjing Univ Posts & Telecommun, Coll Automat, Nanjing 210023, Jiangsu, Peoples R China
[3] Henan Univ, Sch Comp & Informat Engn, Kaifeng 475001, Peoples R China
[4] Univ Newcastle, Sch Elect Engn & Comp, Callaghan, NSW 2308, Australia
关键词
Heterogeneous defect prediction; multiple sources; privacy preservation; utility; source selection; manifold discriminant alignment; STATIC CODE ATTRIBUTES; RESEARCHER BIAS; MACHINE; CLASSIFICATION; METRICS; MODELS; FAULTS;
D O I
10.1109/TSE.2017.2780222
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Heterogeneous defect prediction (HDP) refers to predicting defect-proneness of software modules in a target project using heterogeneous metric data from other projects. Existing HDP methods mainly focus on predicting target instances with single source. In practice, there exist plenty of external projects. Multiple sources can generally provide more information than a single project. Therefore, it is meaningful to investigate whether the HDP performance can be improved by employing multiple sources. However, a precondition of conducting HDP is that the external sources are available. Due to privacy concerns, most companies are not willing to share their data. To facilitate data sharing, it is essential to study how to protect the privacy of data owners before they release their data. In this paper, we study the above two issues in HDP. Specifically, to utilize multiple sources effectively, we propose a multi-source selection based manifold discriminant alignment (MSMDA) approach. To protect the privacy of data owners, a sparse representation based double obfuscation algorithm is designed and applied to HDP. Through a case study of 28 projects, our results show that MSMDA can achieve better performance than a range of baseline methods. The improvement is 3.4-15:3 percent in g-measure and 3.0-19:1 percent in AUC.
引用
收藏
页码:391 / 411
页数:21
相关论文
共 100 条
[1]  
[Anonymous], EMPIRICAL SOFTWARE E
[2]  
[Anonymous], P 30 IEEE ACM INT C
[3]   Assessing the applicability of fault-proneness models across object-oriented software projects [J].
Briand, LC ;
Melo, WL ;
Wüst, J .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2002, 28 (07) :706-720
[4]  
Budi A, 2011, PLDI 11: PROCEEDINGS OF THE 2011 ACM CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION, P447
[5]   Defect prediction as a multiobjective optimization problem [J].
Canfora, Gerardo ;
De Lucia, Andrea ;
Di Penta, Massimiliano ;
Oliveto, Rocco ;
Panichella, Annibale ;
Panichella, Sebastiano .
SOFTWARE TESTING VERIFICATION & RELIABILITY, 2015, 25 (04) :426-459
[6]   Better bug reporting with better privacy [J].
Castro, Miguel ;
Costa, Manuel ;
Martin, Jean-Philippe .
ACM SIGPLAN NOTICES, 2008, 43 (03) :319-328
[7]   Empirical analysis of network measures for predicting high severity software faults [J].
Chen, Lin ;
Ma, Wanwangying ;
Zhou, Yuming ;
Xu, Lei ;
Wang, Ziyuan ;
Chen, Zhifei ;
Xu, Baowen .
SCIENCE CHINA-INFORMATION SCIENCES, 2016, 59 (12)
[8]   Negative samples reduction in cross-company software defects prediction [J].
Chen, Lin ;
Fang, Bin ;
Shang, Zhaowei ;
Tang, Yuanyan .
INFORMATION AND SOFTWARE TECHNOLOGY, 2015, 62 :67-77
[9]  
Cheng M., 2016, 28 INT C SOFTWARE EN, P171
[10]  
Clause J, 2011, 2011 33RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), P21, DOI 10.1145/1985793.1985797