CLASSIFICATION OF LARGE MICROARRAY DATASETS USING FAST RANDOM FOREST CONSTRUCTION

被引:9
作者
Manilich, Elena A. [1 ,2 ]
Oezsoyoglu, Z. Meral [1 ]
Trubachev, Valeriy [2 ]
Radivoyevitch, Tomas [3 ]
机构
[1] Case Western Reserve Univ, Dept Comp Sci, Cleveland, OH 44106 USA
[2] Cleveland Clin, Inst Digest Dis, Cleveland, OH 44195 USA
[3] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44106 USA
关键词
Algorithm; data mining; genomic; classifier; random forest; ensemble algorithm; optimize; bootstrap samples; machine learning; microarray; analysis; gene expression; file-based implementation; multi-dimensional data; high-dimensional data; CELL; EXPRESSION; PREDICTION; SURVIVAL;
D O I
10.1142/S021972001100546X
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high-dimensional microarray datasets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets, as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.
引用
收藏
页码:251 / 267
页数:17
相关论文
共 21 条
[1]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[3]  
Borisov A., 2005, INTEL TECHNOL J, V9, P143
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Identifying SNPs predictive of phenotype using random forests [J].
Bureau, A ;
Dupuis, J ;
Falls, K ;
Lunetta, KL ;
Hayward, B ;
Keith, TP ;
Van Eerdewegh, P .
GENETIC EPIDEMIOLOGY, 2005, 28 (02) :171-182
[6]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)
[7]   Evaluating the Ability of Tree-Based Methods and Logistic Regression for the Detection of SNP-SNP Interaction [J].
Garcia-Magarinos, Manuel ;
Lopez-de-Ullibarri, Inaki ;
Cao, Ricardo ;
Salas, Antonio .
ANNALS OF HUMAN GENETICS, 2009, 73 :360-369
[8]  
Gehrke J., 1998, Proceedings of the Twenty-Fourth International Conference on Very-Large Databases, P416
[9]  
Han P., 2009, BMC Bioinformatics, V10
[10]   Development of a clinically feasible molecular assay to predict recurrence of stage II colon cancer [J].
Jiang, Yuqiu ;
Casey, Graham ;
Lavery, Ian C. ;
Zhang, Yi ;
Talantov, Dmitri ;
Martin-McGreevy, Michelle ;
Skacel, Marek ;
Manilich, Elena ;
Mazumder, Abhijit ;
Atkins, David ;
Delaney, Concir P. ;
Wang, Yixin .
JOURNAL OF MOLECULAR DIAGNOSTICS, 2008, 10 (04) :346-354