SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests

被引：45

作者：

Wu, Qingyao ^{[1
]}

Ye, Yunming ^{[1
]}

Liu, Yang ^{[2
]}

Ng, Michael K. ^{[2
]}

机构：

[1] Harbin Inst Technol, Shenzhen Grad Sch, Dept Comp Sci, Harbin, Peoples R China

[2] Hong Kong Baptist Univ, Dept Math, Hong Kong, Hong Kong, Peoples R China

来源：

IEEE TRANSACTIONS ON NANOBIOSCIENCE | 2012年 / 11卷 / 03期

关键词：

Genome-wide association study; SNP; random forest; stratified sampling; VARIABLE IMPORTANCE; MISSING DATA; ASSOCIATION;

D O I：

10.1109/TNB.2012.2214232

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.

引用

页码：216 / 227

页数：12

共 33 条

[21] Screening large-scale association study data: exploiting interactions using random forests
Lunetta, KL
Hayward, LB
Segal, J
Van Eerdewegh, P
[J]. BMC GENETICS, 2004, 5 (1)
[22] Performance of random forest when SNPs are in linkage disequilibrium
Meng, Yan A.
Yu, Yi
Cupples, L. Adrienne
Farrer, Lindsay A.
Lunetta, Kathryn L.
[J]. BMC BIOINFORMATICS, 2009, 10
[23] Bioinformatics challenges for genome-wide association studies
Moore, Jason H.
Asselbergs, Folkert W.
Williams, Scott M.
[J]. BIOINFORMATICS, 2010, 26 (04) : 445 - 455
[24] Schwarz Daniel F, 2007, BMC Proc, V1 Suppl 1, pS59
[25] On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data
Schwarz, Daniel F.
Koenig, Inke R.
Ziegler, Andreas
[J]. BIOINFORMATICS, 2010, 26 (14) : 1752 - 1758
[26] A genome-wide association study identifies novel risk loci for type 2 diabetes
Sladek, Robert
Rocheleau, Ghislain
Rung, Johan
Dina, Christian
Shen, Lishuang
Serre, David
Boutin, Philippe
Vincent, Daniel
Belisle, Alexandre
Hadjadj, Samy
Balkau, Beverley
Heude, Barbara
Charpentier, Guillaume
Hudson, Thomas J.
Montpetit, Alexandre
Pshezhetsky, Alexey V.
Prentki, Marc
Posner, Barry I.
Balding, David J.
Meyre, David
Polychronakos, Constantin
Froguel, Philippe
[J]. NATURE, 2007, 445 (7130) : 881 - 885
[27] A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification
Statnikov, Alexander
Wang, Lily
Aliferis, Constantin F.
[J]. BMC BIOINFORMATICS, 2008, 9 (1)
[28] Bias in random forest variable importance measures: Illustrations, sources and a solution
Strobl, Carolin
Boulesteix, Anne-Laure
Zeileis, Achim
Hothorn, Torsten
[J]. BMC BIOINFORMATICS, 2007, 8 (1)
[29] Conditional variable importance for random forests
Strobl, Carolin
Boulesteix, Anne-Laure
Kneib, Thomas
Augustin, Thomas
Zeileis, Achim
[J]. BMC BIOINFORMATICS, 2008, 9 (1)
[30] Good methods for coping with missing data in decision trees
Twala, B. E. T. H.
Jones, M. C.
Hand, D. J.
[J]. PATTERN RECOGNITION LETTERS, 2008, 29 (07) : 950 - 956

← 1 2 3 4 →