Impact of Data Sampling on Stability of Feature Selection for Software Measurement Data

被引：5

作者：

Gao, Kehan ^{[1
]}

Khoshgoftaar, Taghi M. ^{[2
]}

Napolitano, Amri ^{[2
]}

机构：

[1] Eastern Connecticut State Univ, Willimantic, CT 06226 USA

[2] Florida Atlantic Univ, Boca Raton, FL 33431 USA

来源：

2011 23RD IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2011) | 2011年

关键词：

feature selection; data sampling; software metrics; defect prediction; stability; CLASSIFICATION; ALGORITHMS; MODELS;

D O I：

10.1109/ICTAI.2011.172

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Software defect prediction can be considered a binary classification problem. Generally, practitioners utilize historical software data, including metric and fault data collected during the software development process, to build a classification model and then employ this model to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). Limited project resources can then be allocated according to the prediction results by (for example) assigning more reviews and testing to the modules predicted to be potentially defective. Two challenges often come with the modeling process: (1) high-dimensionality of software measurement data and (2) skewed or imbalanced distributions between the two types of modules (fp and nfp) in those datasets. To overcome these problems, extensive studies have been dedicated towards improving the quality of training data. The commonly used techniques are feature selection and data sampling. Usually, researchers focus on evaluating classification performance after the training data is modified. The present study assesses a feature selection technique from a different perspective. We are more interested in studying the stability of a feature selection method, especially in understanding the impact of data sampling techniques on the stability of feature selection when using the sampled data. Some interesting findings are found based on two case studies performed on datasets from two real-world software projects.

引用

页码：1004 / 1011

页数：8

共 25 条

[1] Robust biomarker identification for cancer diagnosis with ensemble feature selection methods
Abeel, Thomas
Helleputte, Thibault
Van de Peer, Yves
Dupont, Pierre
Saeys, Yvan
[J]. BIOINFORMATICS, 2010, 26 (03) : 392 - 398
[2] [Anonymous], P 9 INT WORKSH MACH
[3] [Anonymous], P 2004 IEEE S COMP I
[4] Boetticher G., 2007, The PROMISE Repository of Empirical Software Engineering Data
[5] SMOTE: Synthetic minority over-sampling technique
Chawla, Nitesh V.
Bowyer, Kevin W.
Hall, Lawrence O.
Kegelmeyer, W. Philip
[J]. 2002, American Association for Artificial Intelligence (16)
[6] Chen ZH, 2005, IEEE SOFTWARE, V22, P38, DOI 10.1109/MS.2005.151
[7] Drummond C., 2003, WORKSHOP LEARNING IM, VVolume 11, P1
[8] Dunne K., 2002, TCDCD200228 TRIN COL
[9] Engen V, 2008, INT J KNOWL-BASED IN, V12, P357
[10] MSR 2007 4th International Workshop on Mining Software Repositories
Gall, Harald
Lanza, Michele
Zimmermann, Thomas
[J]. 29TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: ICSE 2007 COMPANION VOLUME, PROCEEDINGS, 2007, : 107 - +

← 1 2 3 →