Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations

被引:97
作者
Boulesteix, Anne-Laure
Bender, Andreas
Bermejo, Justo Lorenzo [2 ]
Strobl, Carolin [1 ]
机构
[1] LMU Univ Munich, Dept Stat, Munich, Germany
[2] Univ Heidelberg Hosp, Grp Stat Genet, Inst Med Biometry & Informat, Heidelberg, Germany
关键词
random forest; genetic association study; variable importance; variable selection bias; CART; cforest; VARIABLE IMPORTANCE;
D O I
10.1093/bib/bbr053
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, namely that common variants are systematically favoured by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present article is 3-fold: (i) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ unbiased variable selection criteria) as well as for different importance measures (Gini and permutation based); (ii) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants; and (iii) to summarize these results and previously investigated properties of random forest VIMs in the context of genetic association studies and to make practical recommendations regarding the choice of the random forest and variable importance type. All our analyses can be reproduced using R code available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ginibias/.
引用
收藏
页码:292 / 304
页数:13
相关论文
共 26 条
  • [1] Integrating common and rare genetic variation in diverse human populations
    Altshuler, David M.
    Gibbs, Richard A.
    Peltonen, Leena
    Dermitzakis, Emmanouil
    Schaffner, Stephen F.
    Yu, Fuli
    Bonnen, Penelope E.
    de Bakker, Paul I. W.
    Deloukas, Panos
    Gabriel, Stacey B.
    Gwilliam, Rhian
    Hunt, Sarah
    Inouye, Michael
    Jia, Xiaoming
    Palotie, Aarno
    Parkin, Melissa
    Whittaker, Pamela
    Chang, Kyle
    Hawes, Alicia
    Lewis, Lora R.
    Ren, Yanru
    Wheeler, David
    Muzny, Donna Marie
    Barnes, Chris
    Darvishi, Katayoon
    Hurles, Matthew
    Korn, Joshua M.
    Kristiansson, Kati
    Lee, Charles
    McCarroll, Steven A.
    Nemesh, James
    Keinan, Alon
    Montgomery, Stephen B.
    Pollack, Samuela
    Price, Alkes L.
    Soranzo, Nicole
    Gonzaga-Jauregui, Claudia
    Anttila, Verneri
    Brodeur, Wendy
    Daly, Mark J.
    Leslie, Stephen
    McVean, Gil
    Moutsianas, Loukas
    Nguyen, Huy
    Zhang, Qingrun
    Ghori, Mohammed J. R.
    McGinnis, Ralph
    McLaren, William
    Takeuchi, Fumihiko
    Grossman, Sharon R.
    [J]. NATURE, 2010, 467 (7311) : 52 - 58
  • [2] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [3] Breiman L., 2010, randomForest: Breiman and cutler's random forests for classification and regression
  • [4] Supervised machine learning and logistic regression identifies novel epistatic risk factors with PTPN22 for rheumatoid arthritis
    Briggs, F. B. S.
    Ramsay, P. P.
    Madden, E.
    Norris, J. M.
    Holers, V. M.
    Mikuls, T. R.
    Sokka, T.
    Seldin, M. F.
    Gregersen, P. K.
    Criswell, L. A.
    Barcellos, L. F.
    [J]. GENES AND IMMUNITY, 2010, 11 (03) : 199 - 208
  • [5] Identifying SNPs predictive of phenotype using random forests
    Bureau, A
    Dupuis, J
    Falls, K
    Lunetta, KL
    Hayward, B
    Keith, TP
    Van Eerdewegh, P
    [J]. GENETIC EPIDEMIOLOGY, 2005, 28 (02) : 171 - 182
  • [6] Calle M Luz, 2011, Brief Bioinform, V12, P86, DOI 10.1093/bib/bbq011
  • [7] Molecular Reclassification of Crohn's Disease by Cluster Analysis of Genetic Variants
    Cleynen, Isabelle
    John, Jestinah M. Mahachie
    Henckaerts, Liesbet
    Van Moerkercke, Wouter
    Rutgeerts, Paul
    Van Steen, Kristel
    Vermeire, Severine
    [J]. PLOS ONE, 2010, 5 (09):
  • [8] A screening methodology based on Random Forests to improve the detection of gene-gene interactions
    De Lobel, Lizzy
    Geurts, Pierre
    Baele, Guy
    Castro-Giner, Francesc
    Kogevinas, Manolis
    Van Steen, Kristel
    [J]. EUROPEAN JOURNAL OF HUMAN GENETICS, 2010, 18 (10) : 1127 - 1132
  • [9] Grabmeier J. L., 2007, International Journal of Business Intelligence and Data Mining, V2, P213, DOI 10.1504/IJBIDM.2007.013938
  • [10] The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases
    Heidema, A. Geert
    Boer, Jolanda Ma
    Nagelkerke, Nico
    Mariman, Edwin C. M.
    van der A, Daphne L.
    Feskens, Edith J. M.
    [J]. BMC GENETICS, 2006, 7 (1)