High-Dimensional Variable Selection for Survival Data

被引:325
作者
Ishwaran, Hemant [1 ]
Kogalur, Udaya B. [1 ]
Gorodeski, Eiran Z. [2 ]
Minn, Andy J. [3 ]
Lauer, Michael S. [4 ]
机构
[1] Cleveland Clin, Dept Quantitat Hlth Sci, Cleveland, OH 44195 USA
[2] Cleveland Clin, Inst Heart & Vasc, Cleveland, OH 44195 USA
[3] Univ Penn, Dept Radiat Oncol, Philadelphia, PA 19104 USA
[4] NHLBI, Div Cardiovasc Sci, Bethesda, MD 20892 USA
关键词
Forest; Maximal subtree; Minimal depth; Random survival forest; Tree; VIMP; GENE-EXPRESSION PROFILES; PREDICT SURVIVAL; CLASSIFICATION; REGRESSION; CHEMOTHERAPY; SIGNATURE; CANCER; MODEL;
D O I
10.1198/jasa.2009.tm08622
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The minimal depth of a maximal subtree IN a dimensionless order statistic measuring the predictiveness of a variable in a survival tree We derive the distribution of the minimal depth and use it lot high-dimensional variable selection using random survival forests In big p and small n problems (where p is the dimension and n Is the sample size). the distribution of the minimal depth reveals a "ceiling effect" in which a tree simply cannot be grown deep enough to properly identify predictive variables Motivated by this limitation. we develop a new regularized algorithm. termed RSF-Variable Hunting This algorithm exploits maximal subtrees for effective variable selection under such scenarios Several applications are presented demonstrating the methodology. including the problem of gene selection using microarray data In this work we focus only on survival settings. although out methodology also applies to other random forests applications. including regression and classification settings All examples presented here use the R-software package randomSurvivalForest
引用
收藏
页码:205 / 217
页数:13
相关论文
共 47 条
[1]  
[Anonymous], 1991, Counting Processes and Survival Analysis
[2]   Semi-supervised methods to predict patient survival from gene expression data [J].
Bair, E ;
Tibshirani, R .
PLOS BIOLOGY, 2004, 2 (04) :511-522
[3]  
BAIR E, 2004, SUPERPC SUPERVISED P
[4]   Gene-expression profiles predict survival of patients with lung adenocarcinoma [J].
Beer, DG ;
Kardia, SLR ;
Huang, CC ;
Giordano, TJ ;
Levin, AM ;
Misek, DE ;
Lin, L ;
Chen, GA ;
Gharib, TG ;
Thomas, DG ;
Lizyness, ML ;
Kuick, R ;
Hayasaka, S ;
Taylor, JMG ;
Iannettoni, MD ;
Orringer, MB ;
Hanash, S .
NATURE MEDICINE, 2002, 8 (08) :816-824
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]   Statistical modeling: The two cultures [J].
Breiman, L .
STATISTICAL SCIENCE, 2001, 16 (03) :199-215
[7]   Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia [J].
Bullinger, L ;
Döhner, K ;
Bair, E ;
Fröhling, S ;
Schlenk, RF ;
Tibshirani, R ;
Döhner, H ;
Pollack, JR .
NEW ENGLAND JOURNAL OF MEDICINE, 2004, 350 (16) :1605-1616
[8]   Identifying SNPs predictive of phenotype using random forests [J].
Bureau, A ;
Dupuis, J ;
Falls, K ;
Lunetta, KL ;
Hayward, B ;
Keith, TP ;
Van Eerdewegh, P .
GENETIC EPIDEMIOLOGY, 2005, 28 (02) :171-182
[9]  
Clarke Jennifer, 2008, Stat Methodol, V5, P238, DOI 10.1016/j.stamet.2007.09.003
[10]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)