The peaking phenomenon in the presence of feature-selection

被引:60
作者
Sima, Chao [1 ]
Dougherty, Edward R. [1 ,2 ]
机构
[1] Translat Genom Res Inst, Computat Biol Div, Phoenix, AZ 85004 USA
[2] Texas A&M Univ, Dept Elect & Comp Engn, College Stn, TX 77843 USA
基金
美国国家科学基金会;
关键词
classification; feature-selection; peaking phenomenon;
D O I
10.1016/j.patrec.2008.04.010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For a fixed sample size, a common phenomenon is that the error of a designed classifier decreases and then increases as the number of features grows. This peaking phenomenon has been recognized for forty years and depends on the classification rule and feature-label distribution. Historically, the peaking phenomenon has been treated by assuming a fixed Ordering of the features, usually beginning with the strongest individual feature and proceeding with features of decreasing individual classification capability. This does not take into account feature-selection, which is commonplace in high-dimensional and small sample settings. This paper revisits the peaking phenomenon in the presence of feature-selection. Using massive simulation in a high-performance computing environment, the paper considers various combinations of feature-label models, feature-selection algorithms, and classifier models to produce a large library of error versus feature size curves. Owing to the prevalence of feature-selection in genomic classification, we also consider gene-expression-based classification of breast-cancer patient prognosis. Results vary widely and are strongly dependent on the combination. The error curves tend to fall into three categories: peaking, settling into a plateau, or falling very slowly over a long range of feature set sizes. It can be concluded that one should be wary of applying peaking results found in the absence of feature-selection to settings in which feature-selection is employed. (c) 2008 Elsevier B.V. All rights reserved.
引用
收藏
页码:1667 / 1674
页数:8
相关论文
共 23 条