A Survey on Cross-Project Software Defect Prediction Methods

被引:0
作者
Chen X. [1 ,2 ]
Wang L.-P. [1 ]
Gu Q. [2 ]
Wang Z. [3 ]
Ni C. [2 ]
Liu W.-S. [2 ]
Wang Q.-P. [1 ]
机构
[1] School of Computer Science and Technology, Nantong University, Nantong, 226019, Jiangsu
[2] State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing
[3] School of Computer Software, Tianjin University, Tianjin
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2018年 / 41卷 / 01期
基金
中国国家自然科学基金;
关键词
Cross-project defect prediction; Empirical software engineering; Empirical studies; Software defect prediction; Transfer learning;
D O I
10.11897/SP.J.1016.2018.00254
中图分类号
学科分类号
摘要
Software defect prediction firstly analyzes and mines software historical repositories to extract program modules and label them. It secondly designs novel metrics, which have strong correlation with defects, based on the analysis on code complexity or development process. Then it uses these metrics to measure these program modules. It finally uses a specific machine learning algorithm to construct software defect prediction models, which are trained on these datasets. Therefore software defect prediction can optimize the software testing resource allocation by identifying the potential defect modules in advance. However in real software development, a project, which needs defect prediction, maybe a new project or it maybe has less training data. A simple solution is directly using training data from other projects to construct the model. However application domain, development process, used programming language, developer experience of different projects may be not same. This will cause the distribution of corresponding datasets to be large and result in the poor performance of defect prediction. Therefore, how to effectively transfer the knowledge of the source project to build a defect prediction model for the target project has attracted the attention of researchers, and this problem is called cross-project defect prediction (CPDP).We conduct a comprehensive survey on this topic and classify existing methods into three categories: supervised learning based methods, unsupervised learning based methods, and semi-supervised learning based methods. In particular, the supervised learning based methods will use the modules of the source project to construct the model. These methods can be further classified into two categories: homogeneous cross-project defect learning and heterogeneous cross-project defect prediction based on whether the source project and the target project use the same metric set. For the former, researchers design novel methods by using metric value transformation, instance selection and weight setting, feature mapping and selection, ensemble learning, class imbalance learning. For the latter, the issue is more challenging and researchers design novel methods by using feature mapping and canonical correlation analysis. The unsupervised learning based methods will attempt to make a prediction on the modules of the target project immediately. The assumption of these methods is that the metric value of defective modules has the tendency to be higher than the metric value of non-defective modules. Researchers design novel methods by using cluster algorithms. The semi-supervised learning based methods will use the modules of the source project and some labeled programs in the target project together to construct the model. These methods try to improve the performance of CPDP by identifying some representative program modules in the target project and label them manually. Researchers design novel methods by using ensemble learning and TrAdaBoost. We summarize and comment the existing research work for each category in sequence. Then we analyze the commonly used performance metrics and benchmarks in empirical studies in CPDP for other researchers to better design empirical studies. Finally we conclude this paper and discuss some potentially future research work from four dimensions: dataset gathering, dataset preprocessing, CPDP model construction and evaluation, and CPDP model application. © 2018, Science Press. All right reserved.
引用
收藏
页码:254 / 274
页数:20
相关论文
共 93 条
[1]  
Hall T., Beecham S., Bowes D., Et al., A systematic literature review on fault prediction performance in software engineering, IEEE Transactions on Software Engineering, 38, 6, pp. 1276-1304, (2012)
[2]  
Wang Q., Wu S.-J., Li M.-S., Software defect prediction, Journal of Software, 19, 7, pp. 1565-1580, (2008)
[3]  
Chen X., Gu Q., Liu W.-S., Et al., Survey of static software defect prediction, Journal of Software, 27, 1, pp. 1-25, (2016)
[4]  
Yu S.-S., Zhou S.-G., Guan J.-H., Software engineering data mining: A survey, Journal of Frontiers of Computer Science and Technology, 6, 1, pp. 1-31, (2012)
[5]  
Radjenovic D., Hericko M., Torkar R., Zivkovic A., Software fault prediction metrics: A systematic literature review, Information and Software Technology, 55, 8, pp. 1397-1418, (2013)
[6]  
Pan S.J., Yang Q., A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering, 22, 10, pp. 1345-1359, (2010)
[7]  
Zhuang F.-Z., Luo P., He Q., Shi Z.-Z., Survey on transfer learning research, Journal of Software, 26, 1, pp. 26-39, (2015)
[8]  
Briand L., Melo W., Wust J., Assessing the applicability of fault-proneness models across object-oriented software projects, IEEE Transactions on Software Engineering, 28, 7, pp. 706-720, (2002)
[9]  
Zimmermann T., Nagappan N., Gall H., Et al., Cross-project defect prediction: A large scale experiment on data vs. domain vs. process, Proceedings of the Joint Meeting of the European Software Engineering Conference and the International Symposium on the Foundations of Software Engineering, pp. 91-100, (2009)
[10]  
He Z., Shu F., Yang Y., Et al., An investigation on the feasibility of cross-project defect prediction, Automated Software Engineering, 19, 2, pp. 167-199, (2012)