Why classification models using array gene expression data perform so well: A preliminary investigation of explanatory factors

被引:0
作者
Aliferis, CF [1 ]
Tsamardinos, I [1 ]
Massion, P [1 ]
Statnikov, AR [1 ]
Hardin, D [1 ]
机构
[1] Vanderbilt Univ, Dept Biomed Informat, Discovery Syst Lab, Nashville, TN 37232 USA
来源
METMBS'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MATHEMATICS AND ENGINEERING TECHNIQUES IN MEDICINE AND BIOLOGICAL SCIENCES | 2003年
关键词
bioinformatics and medicine; gene expression; expression data analysis;
D O I
暂无
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Results in the literature of classification models from microarray data often appear to be exceedingly good relative to most other domains of machine learning and clinical diagnostics. Yet array data are noisy, and have very small sample-to-variable ratios. What is the explanation for such exemplary, yet counter-intuitive, classification performance? Answering this question has significant implications (a) for the broad acceptance of such models by the medical and biostatistical community, and (b) for gaining valuable insight on the properties of this domain. To address this problem we build several models for three classification tasks in a gene expression array dataset with 12,600 oligonucleotides and 203 patient cases. We then study the effects of. classifier type (kernel-based/non-kernel-based, linear/non-linear), sample size, sample selection within cross-validation, and gene information redundancy. Our analyses show that gene redundancy and classifier choice have the strongest effects on performance. Linear bias in the classifiers, and sample size (as long as kernel classifiers are used) have relatively small effects; train-test sample ratio, and the choice of cross-validation sample selection method appear to have small-to-negligible effects.
引用
收藏
页码:47 / 53
页数:7
相关论文
共 16 条
  • [1] Aliferis CF, 2002, AMIA 2002 SYMPOSIUM, PROCEEDINGS, P7
  • [2] ALIFERIS CF, 2003, IN PRESS FLAIRS
  • [3] [Anonymous], LIBSVM LIB SUPPORT V
  • [4] Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
    Bhattacharjee, A
    Richards, WG
    Staunton, J
    Li, C
    Monti, S
    Vasa, P
    Ladd, C
    Beheshti, J
    Bueno, R
    Gillette, M
    Loda, M
    Weber, G
    Mark, EJ
    Lander, ES
    Wong, W
    Johnson, BE
    Golub, TR
    Sugarbaker, DJ
    Meyerson, M
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) : 13790 - 13795
  • [5] COMPARING THE AREAS UNDER 2 OR MORE CORRELATED RECEIVER OPERATING CHARACTERISTIC CURVES - A NONPARAMETRIC APPROACH
    DELONG, ER
    DELONG, DM
    CLARKEPEARSON, DI
    [J]. BIOMETRICS, 1988, 44 (03) : 837 - 845
  • [6] DEMUTH H, 2001, NEURAL NETWORK TOOLB
  • [7] Domingos P., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P37
  • [8] Hagan MT., 1996, NEURAL NETWORK DESIG
  • [9] Hart, 2006, PATTERN CLASSIFICATI
  • [10] KOHANE IS, 2000, MICROARRAYS INTEGRAT