Stability of Filter- and Wrapper-based Software Metric Selection Techniques

被引:0
作者
Wang, Huanjing [1 ]
Khoshgoftaar, Taghi M. [2 ]
Napolitano, Amri [2 ]
机构
[1] Western Kentucky Univ, Bowling Green, KY 42101 USA
[2] Florida Atlantic Univ, Boca Raton, FL 33431 USA
来源
2014 IEEE 15TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI) | 2014年
关键词
feature subset selection; software measurements; filters; wrappers; stability; ALGORITHMS;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
For most software systems, some of the software metrics collected during the software development cycle may contain redundant information, provide no information, or may have an adverse effect on prediction models built with these metrics. An intelligent selection of software metrics (features) using feature selection techniques (which reduce the feature subset to an optimal size) prior to building defect prediction models may improve the final defect prediction results. While some feature selection techniques consider each feature individually, feature subset selection evaluates entire feature subsets and thus can help remove redundant features. Unfortunately, feature subset selection may have the problem of selecting different features from similar datasets. This paper addresses the question of which feature subset selection methods are stable in the face of changes to the data (here, the addition or removal of instances). We examine twenty-seven feature subset selection methods, including two filter-based techniques and twenty-five wrapper-based techniques (five choices of wrapper learner combined with five choices of wrapper performance metric). We used the Average Tanimoto Index (ATI) as our stability metric, because it is able to compare two feature subsets of different size. All experiments were conducted on three software metric datasets from a real-world software project. Our results show that the Correlation-Based Feature Selection (CFS) approach has the greatest stability overall. All wrapper-based techniques are less stable than CFS. Among the twenty-five wrappers, in general the Naive Bayes learner using either the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) or the Area Under the Precision-Recall Curve (PRC) performance metrics are the most stable wrapper-based approaches.
引用
收藏
页码:309 / 314
页数:6
相关论文
共 27 条
  • [1] [Anonymous], 2002, MACH LEARN
  • [2] [Anonymous], MACHINE LEARNING
  • [3] [Anonymous], THESIS
  • [4] [Anonymous], PATTERN RECOGN LETT
  • [5] [Anonymous], 2001, Neural Networks: A Comprehensive Foundation
  • [6] Christianini N., 2000, INTRO SUPPORT VECTOR, P189
  • [7] Consistency-based search in feature selection
    Dash, M
    Liu, HA
    [J]. ARTIFICIAL INTELLIGENCE, 2003, 151 (1-2) : 155 - 176
  • [8] Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
  • [9] MSR 2007 4th International Workshop on Mining Software Repositories
    Gall, Harald
    Lanza, Michele
    Zimmermann, Thomas
    [J]. 29TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: ICSE 2007 COMPANION VOLUME, PROCEEDINGS, 2007, : 107 - +
  • [10] Guyon I., 2003, J MACH LEARN RES, V3, P1157