FEATURE SELECTION IN OMICS PREDICTION PROBLEMS USING CAT SCORES AND FALSE NONDISCOVERY RATE CONTROL

被引:92
作者
Ahdesmaeki, Miika [1 ,2 ]
Strimmer, Korbinian [1 ]
机构
[1] Univ Leipzig, IMISE, D-04107 Leipzig, Germany
[2] Tampere Univ Technol, Dept Signal Proc, FI-33101 Tampere, Finland
关键词
Feature selection; linear discriminant analysis; correlation; James-Stein estimator; small n; large p" setting; correlation-adjusted t-score; false discovery rates; higher criticism; LINEAR DISCRIMINANT-ANALYSIS; SHRUNKEN CENTROIDS; CLASSIFICATION; REGRESSION; DISCOVERY; RANKING; BAYES;
D O I
10.1214/09-AOAS277
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted t-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James-Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package "sda" available from the R repository CRAN.
引用
收藏
页码:503 / 519
页数:17
相关论文
共 30 条
  • [1] A general modular framework for gene set enrichment analysis
    Ackermann, Marit
    Strimmer, Korbinian
    [J]. BMC BIOINFORMATICS, 2009, 10
  • [2] Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
    Alizadeh, AA
    Eisen, MB
    Davis, RE
    Ma, C
    Lossos, IS
    Rosenwald, A
    Boldrick, JG
    Sabet, H
    Tran, T
    Yu, X
    Powell, JI
    Yang, LM
    Marti, GE
    Moore, T
    Hudson, J
    Lu, LS
    Lewis, DB
    Tibshirani, R
    Sherlock, G
    Chan, WC
    Greiner, TC
    Weisenburger, DD
    Armitage, JO
    Warnke, R
    Levy, R
    Wilson, W
    Grever, MR
    Byrd, JC
    Botstein, D
    Brown, PO
    Staudt, LM
    [J]. NATURE, 2000, 403 (6769) : 503 - 511
  • [3] Selection bias in gene extraction on the basis of microarray gene-expression data
    Ambroise, C
    McLachlan, GJ
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) : 6562 - 6566
  • [4] Some theory for Fisher's linear discriminant function, 'naive Bayes', and some alternatives when there are many more variables than observations
    Bickel, PJ
    Levina, E
    [J]. BERNOULLI, 2004, 10 (06) : 989 - 1010
  • [5] Optimality Driven Nearest Centroid Classification from Genomic Data
    Dabney, Alan R.
    Storey, John D.
    [J]. PLOS ONE, 2007, 2 (10):
  • [6] Higher criticism thresholding: Optimal feature selection when useful features are rare and weak
    Donoho, David
    Jin, Jiashun
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2008, 105 (39) : 14790 - 14795
  • [7] Large-scale simultaneous hypothesis testing: The choice of a null hypothesis
    Efron, B
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2004, 99 (465) : 96 - 104
  • [8] EFFICIENCY OF LOGISTIC REGRESSION COMPARED TO NORMAL DISCRIMINANT-ANALYSIS
    EFRON, B
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1975, 70 (352) : 892 - 898
  • [9] EFRON B, 2008, EMPIRICAL BAYES ESTI
  • [10] Microarrays, empirical Bayes and the two-groups model
    Efron, Bradley
    [J]. STATISTICAL SCIENCE, 2008, 23 (01) : 1 - 22