Optimality Driven Nearest Centroid Classification from Genomic Data
被引:26
作者:
Dabney, Alan R.
论文数: 0引用数: 0
h-index: 0
机构:
Texas A&M Univ, Dept Stat, College Stn, TX 77843 USATexas A&M Univ, Dept Stat, College Stn, TX 77843 USA
Dabney, Alan R.
[1
]
Storey, John D.
论文数: 0引用数: 0
h-index: 0
机构:
Univ Washington, Dept Biostat, Seattle, WA 98195 USA
Univ Washington, Dept Genome Sci, Seattle, WA 98195 USATexas A&M Univ, Dept Stat, College Stn, TX 77843 USA
Storey, John D.
[2
,3
]
机构:
[1] Texas A&M Univ, Dept Stat, College Stn, TX 77843 USA
[2] Univ Washington, Dept Biostat, Seattle, WA 98195 USA
[3] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
来源:
PLOS ONE
|
2007年
/
2卷
/
10期
关键词:
D O I:
10.1371/journal.pone.0001002
中图分类号:
O [数理科学和化学];
P [天文学、地球科学];
Q [生物科学];
N [自然科学总论];
学科分类号:
07 ;
0710 ;
09 ;
摘要:
Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.