Cross-Project and Within-Project Semisupervised Software Defect Prediction: A Unified Approach

被引:111
作者
Wu, Fei [1 ]
Jing, Xiao-Yuan [1 ,2 ]
Sun, Ying [1 ]
Sun, Jing [1 ]
Huang, Lin [1 ]
Cui, Fangyi [1 ]
Sun, Yanfei [1 ]
机构
[1] Nanjing Univ Posts & Telecommun, Coll Automat, Nanjing 210003, Jiangsu, Peoples R China
[2] Wuhan Univ, Sch Comp, State Key Lab Software Engn, Wuhan 430072, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
Cost-sensitive kernelized semisupervised dictionary learning (CKSDL); cross-project semisupervised defect prediction (CSDP); within-project semisupervised defect prediction (WSDP); NETWORKS; MACHINE;
D O I
10.1109/TR.2018.2804922
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
When there exist not enough historical defect data for building an accurate prediction model, semisupervised defect prediction (SSDP) and cross-project defect prediction (CPDP) are two feasible solutions. Existing CPDP methods assume that the available source data are well labeled. However, due to expensive human efforts for labeling a large amount of defect data, usually, we can only utilize the suitable unlabeled source data. We call CPDP in this scenario as cross-project semisupervised defect prediction (CSDP). Although some within-project semisupervised defect prediction (WSDP) methods have been developed in recent years, there still exists much room for improvement on prediction performance. In this paper, we aim to provide a unified and effective solution for both CSDP and WSDP problems. We introduce the semisupervised dictionary learning technique and propose a cost-sensitive kernelized semisupervised dictionary learning (CKSDL) approach. CKSDL can make full use of the limited labeled defect data and a large amount of unlabeled data in the kernel space. In addition, CKSDL considers the misclassification costs in the dictionary learning process. Extensive experiments on 16 projects indicate that CKSDL outperforms state-of-the-art WSDP methods, using unlabeled cross-project defect data can help improve the WSDP performance, and CKSDL generally obtains significantly better prediction performance than related SSDP methods in the CSDP scenario.
引用
收藏
页码:581 / 597
页数:17
相关论文
共 80 条
[1]   An empirical study based on semi-supervised hybrid self-organizing map for software fault prediction [J].
Abaei, Golnoush ;
Selamat, Ali ;
Fujita, Hamido .
KNOWLEDGE-BASED SYSTEMS, 2015, 74 :28-39
[2]  
[Anonymous], 2004, KERNEL METHODS PATTE
[3]  
[Anonymous], 2009, COMPUTATION
[4]   Heterogeneous Defect Prediction [J].
Nam, Jaechang ;
Fu, Wei ;
Kim, Sunghun ;
Menzies, Tim ;
Tan, Lin .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2018, 44 (09) :874-896
[5]  
[Anonymous], PROC INT CONF SOFTW
[6]  
[Anonymous], P 38 INT C SOFTW ENG
[7]   Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm [J].
Bishnu, Partha Sarathi ;
Bhattacherjee, Vandana .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (06) :1146-1150
[8]   A Comparison of Semi-Supervised Classification Approaches for Software Defect Prediction [J].
Catal, Cagatay .
JOURNAL OF INTELLIGENT SYSTEMS, 2014, 23 (01) :75-82
[9]   Unlabelled extra data do not always mean extra performance for semi-supervised fault prediction [J].
Catal, Cagatay ;
Diri, Banu .
EXPERT SYSTEMS, 2009, 26 (05) :458-471
[10]   Clustering and Metrics Thresholds Based Software Fault Prediction of Unlabeled Program Modules [J].
Catal, Cagatay ;
Sevim, Ugur ;
Diri, Banu .
PROCEEDINGS OF THE 2009 SIXTH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: NEW GENERATIONS, VOLS 1-3, 2009, :199-+