A comparative study of iterative and non-iterative feature selection techniques for software defect prediction

被引:57
作者
Khoshgoftaar, Taghi M. [1 ]
Gao, Kehan [2 ]
Napolitano, Amri [1 ]
Wald, Randall [1 ]
机构
[1] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, Empir Software Engn Lab, Boca Raton, FL 33431 USA
[2] Eastern Connecticut State Univ, Willimantic, CT 06226 USA
关键词
Iterative feature selection; Software defect prediction; Date sampling; High dimensionality; Class imbalance;
D O I
10.1007/s10796-013-9430-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Two important problems which can affect the performance of classification models are high-dimensionality (an overabundance of independent features in the dataset) and imbalanced data (a skewed class distribution which creates at least one class with many fewer instances than other classes). To resolve these problems concurrently, we propose an iterative feature selection approach, which repeated applies data sampling (in order to address class imbalance) followed by feature selection (in order to address high-dimensionality), and finally we perform an aggregation step which combines the ranked feature lists from the separate iterations of sampling. This approach is designed to find a ranked feature list which is particularly effective on the more balanced dataset resulting from sampling while minimizing the risk of losing data through the sampling step and missing important features. To demonstrate this technique, we employ 18 different feature selection algorithms and Random Undersampling with two post-sampling class distributions. We also investigate the use of sampling and feature selection without the iterative step (e.g., using the ranked list from a single iteration, rather than combining the lists from multiple iterations), and compare these results from the version which uses iteration. Our study is carried out using three groups of datasets with different levels of class balance, all of which were collected from a real-world software system. All of our experiments use four different learners and one feature subset size. We find that our proposed iterative feature selection approach outperforms the non-iterative approach.
引用
收藏
页码:801 / 822
页数:22
相关论文
共 31 条
[1]  
[Anonymous], P 2004 IEEE S COMP I
[2]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[3]  
Boetticher G., 2007, The PROMISE Repository of Empirical Software Engineering Data
[4]  
Chen ZH, 2005, IEEE SOFTWARE, V22, P38, DOI 10.1109/MS.2005.151
[5]  
Cristianini Nello, 2000, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, DOI DOI 10.1017/CB09780511801389
[6]  
de Souza JT, 2005, LECT NOTES ARTIF INT, V3721, P667
[7]  
Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
[8]   MSR 2007 4th International Workshop on Mining Software Repositories [J].
Gall, Harald ;
Lanza, Michele ;
Zimmermann, Thomas .
29TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: ICSE 2007 COMPANION VOLUME, PROCEEDINGS, 2007, :107-+
[9]   Predicting high-risk program modules by selecting the right software measurements [J].
Gao, Kehan ;
Khoshgoftaar, Taghi M. ;
Seliya, Naeem .
SOFTWARE QUALITY JOURNAL, 2012, 20 (01) :3-42
[10]  
Goh L., 2004, P 2 C ASIA PACIFIC B, V29, P161