Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions

被引：222

作者：

Somorjai, RL ^{[1
]}

Dolenko, B ^{[1
]}

Baumgartner, R ^{[1
]}

机构：

[1] Natl Res Council Canada, Inst Biodiagnost, Winnipeg, MB R3B 1Y6, Canada

来源：

BIOINFORMATICS | 2003年 / 19卷 / 12期

关键词：

D O I：

10.1093/bioinformatics/btg182

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance. spectra. One is the 'curse of dimensionality': the number of features characterizing these data is in the thousands or tens of thousands. The other is the 'curse of dataset sparsity': the number of samples is limited. The consequences of these two curses are far-reaching when such data are used to classify the presence or absence of disease. Results: Using very simple classifiers, we show for several publicly available microarray and proteomics datasets how these curses influence classification outcomes. In particular, even if the sample per feature ratio is increased to the recommended 5-10 by feature extraction/reduction methods, dataset sparsity can render any classification result statistically suspect. In addition, several 'optimal' feature sets are typically identifiable for sparse datasets, all producing perfect classification results, both for the training and independent validation sets. This non-uniqueness leads to interpretational difficulties and casts doubt on the biological relevance of any of these 'optimal' feature sets. We suggest an approach to assess the relative quality of apparently equally good classifiers.

引用

页码：1484 / 1491

页数：8

共 52 条

[1]

Adam BL, 2002, CANCER RES, V62, P3609

[2] Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].

Alon, U ;

Barkai, N ;

Notterman, DA ;

Gish, K ;

Ybarra, S ;

Mack, D ;

Levine, AJ .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750

[3]

[Anonymous], 1996, PATTERN CLASSIFICATI

[4]

Benjamini Y, 2001, ANN STAT, V29, P1165

[5] On the adaptive control of the false discovery fate in multiple testing with independent statistics [J].

Benjamini, Y ;

Hochberg, Y .

JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICS, 2000, 25 (01) :60-83

[6] Molecular classification of cutaneous malignant melanoma by gene expression profiling [J].

Bittner, M ;

Meitzer, P ;

Chen, Y ;

Jiang, Y ;

Seftor, E ;

Hendrix, M ;

Radmacher, M ;

Simon, R ;

Yakhini, Z ;

Ben-Dor, A ;

Sampas, N ;

Dougherty, E ;

Wang, E ;

Marincola, F ;

Gooden, C ;

Lueders, J ;

Glatfelter, A ;

Pollock, P ;

Carpten, J ;

Gillanders, E ;

Leja, D ;

Dietrich, K ;

Beaudry, C ;

Berens, M ;

Alberts, D ;

Sondak, V ;

Hayward, N ;

Trent, J .

NATURE, 2000, 406 (6795) :536-540

[7]

Bo TH, 2002, GENOME BIOL, V3

[8]

Borg I., 1997, MODERN MULTIDIMENSIO

[9] Knowledge-based analysis of microarray gene expression data by using support vector machines [J].

Brown, MPS ;

Grundy, WN ;

Lin, D ;

Cristianini, N ;

Sugnet, CW ;

Furey, TS ;

Ares, M ;

Haussler, D .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267

[10] BEST 2 INDEPENDENT MEASUREMENTS ARE NOT 2 BEST [J].

COVER, TM .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1974, SMC4 (01) :116-117

← 1 2 3 4 5 6 →