Predictor correlation impacts machine learning algorithms: implications for genomic studies

被引:128
作者
Nicodemus, Kristin K. [1 ,2 ,3 ]
Malley, James D. [4 ]
机构
[1] Univ Oxford, Dept Stat Genet, Wellcome Trust Ctr Human Genet, Oxford OX3 7BN, England
[2] Univ Oxford, Dept Clin Pharmacol, Oxford OX3 7DQ, England
[3] NIMH, Cognit & Psychosis Program, Intramural Res Program, Bethesda, MD 20892 USA
[4] NIH, Math & Stat Comp Lab, Div Computat Biosci, Ctr Informat Technol, Bethesda, MD 20892 USA
基金
英国惠康基金; 美国国家卫生研究院;
关键词
RANDOM FORESTS; HAPLOTYPE RECONSTRUCTION; EXPLOITING INTERACTIONS; VARIABLE IMPORTANCE; CLASSIFICATION; SNPS; SELECTION; TRAITS;
D O I
10.1093/bioinformatics/btp331
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The advent of high-throughput genomics has produced studies with large numbers of predictors (e. g. genome-wide association, microarray studies). Machine learning algorithms (MLAs) are a computationally efficient way to identify phenotype-associated variables in high-dimensional data. There are important results from mathematical theory and numerous practical results documenting their value. One attractive feature of MLAs is that many operate in a fully multivariate environment, allowing for small-importance variables to be included when they act cooperatively. However, certain properties of MLAs under conditions common in genomic-related data have not been well-studied-in particular, correlations among predictors pose a problem. Results: Using extensive simulation, we showed considering correlation within predictors is crucial in making valid inferences using variable importance measures (VIMs) from three MLAs: random forest (RF), conditional inference forest (CIF) and Monte Carlo logic regression (MCLR). Using a case-control illustration, we showed that the RF VIMs-even permutation-based-were less able to detect association than other algorithms at effect sizes encountered in complex disease studies. This reduction occurred when 'causal' predictors were correlated with other predictors, and was sharpest when RF tree building used the Gini index. Indeed, RF Gini VIMs are biased under correlation, dependent on predictor correlation strength/number and over-trained to random fluctuations in data when tree terminal node size was small. Permutation-based VIM distributions were less variable for correlated predictors and are unbiased, thus may be preferred when predictors are correlated. MLAs are a powerful tool for high-dimensional data analysis, but well-considered use of algorithms is necessary to draw valid conclusions.
引用
收藏
页码:1884 / 1890
页数:7
相关论文
共 27 条
[1]  
[Anonymous], 2013, A Probabilistic Theory of Pattern Recognition
[2]  
[Anonymous], 2007, R LANG ENV STAT COMP
[3]  
Biau G, 2008, J MACH LEARN RES, V9, P2015
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Identifying SNPs predictive of phenotype using random forests [J].
Bureau, A ;
Dupuis, J ;
Falls, K ;
Lunetta, KL ;
Hayward, B ;
Keith, TP ;
Van Eerdewegh, P .
GENETIC EPIDEMIOLOGY, 2005, 28 (02) :171-182
[6]   Mapping complex traits using Random Forests [J].
Bureau, A ;
Dupuis, J ;
Hayward, B ;
Falls, K ;
Van Eerdewegh, P .
BMC GENETICS, 2003, 4 (Suppl 1)
[7]   BagBoosting for tumor classification with gene expression data [J].
Dettling, M .
BIOINFORMATICS, 2004, 20 (18) :3583-3593
[8]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)
[9]   Repetitive sequence environment distinguishes housekeeping genes [J].
Eller, C. Daniel ;
Regelson, Moira ;
Merriman, Barry ;
Nelson, Stan ;
Horvath, Steve ;
Marahrens, York .
GENE, 2007, 390 (1-2) :153-165
[10]   Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals [J].
Enot, David P. ;
Beckmann, Manfred ;
Overy, David ;
Draper, John .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2006, 103 (40) :14865-14870