FEATURE SELECTION IN OMICS PREDICTION PROBLEMS USING CAT SCORES AND FALSE NONDISCOVERY RATE CONTROL

被引：92

作者：

Ahdesmaeki, Miika ^{[1
,2
]}

Strimmer, Korbinian ^{[1
]}

机构：

[1] Univ Leipzig, IMISE, D-04107 Leipzig, Germany

[2] Tampere Univ Technol, Dept Signal Proc, FI-33101 Tampere, Finland

来源：

ANNALS OF APPLIED STATISTICS | 2010年 / 4卷 / 01期

关键词：

Feature selection; linear discriminant analysis; correlation; James-Stein estimator; small n; large p" setting; correlation-adjusted t-score; false discovery rates; higher criticism; LINEAR DISCRIMINANT-ANALYSIS; SHRUNKEN CENTROIDS; CLASSIFICATION; REGRESSION; DISCOVERY; RANKING; BAYES;

D O I：

10.1214/09-AOAS277

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted t-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James-Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package "sda" available from the R repository CRAN.

引用

页码：503 / 519

页数：17

共 30 条

[1] A general modular framework for gene set enrichment analysis [J].

Ackermann, Marit ;

Strimmer, Korbinian .

BMC BIOINFORMATICS, 2009, 10

[2] Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].

Alizadeh, AA ;

Eisen, MB ;

Davis, RE ;

Ma, C ;

Lossos, IS ;

Rosenwald, A ;

Boldrick, JG ;

Sabet, H ;

Tran, T ;

Yu, X ;

Powell, JI ;

Yang, LM ;

Marti, GE ;

Moore, T ;

Hudson, J ;

Lu, LS ;

Lewis, DB ;

Tibshirani, R ;

Sherlock, G ;

Chan, WC ;

Greiner, TC ;

Weisenburger, DD ;

Armitage, JO ;

Warnke, R ;

Levy, R ;

Wilson, W ;

Grever, MR ;

Byrd, JC ;

Botstein, D ;

Brown, PO ;

Staudt, LM .

NATURE, 2000, 403 (6769) :503-511

[3] Selection bias in gene extraction on the basis of microarray gene-expression data [J].

Ambroise, C ;

McLachlan, GJ .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566

[4] Some theory for Fisher's linear discriminant function, 'naive Bayes', and some alternatives when there are many more variables than observations [J].

Bickel, PJ ;

Levina, E .

BERNOULLI, 2004, 10 (06) :989-1010

[5] Optimality Driven Nearest Centroid Classification from Genomic Data [J].

Dabney, Alan R. ;

Storey, John D. .

PLOS ONE, 2007, 2 (10)

[6] Higher criticism thresholding: Optimal feature selection when useful features are rare and weak [J].

Donoho, David ;

Jin, Jiashun .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2008, 105 (39) :14790-14795

[7] Large-scale simultaneous hypothesis testing: The choice of a null hypothesis [J].

Efron, B .

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2004, 99 (465) :96-104

[8] EFFICIENCY OF LOGISTIC REGRESSION COMPARED TO NORMAL DISCRIMINANT-ANALYSIS [J].

EFRON, B .

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1975, 70 (352) :892-898

[9]

EFRON B, 2008, EMPIRICAL BAYES ESTI

[10] Microarrays, empirical Bayes and the two-groups model [J].

Efron, Bradley .

STATISTICAL SCIENCE, 2008, 23 (01) :1-22

← 1 2 3 →