Maximal conditional chi-square importance in random forests

被引:28
作者
Wang, Minghui [1 ]
Chen, Xiang [1 ]
Zhang, Heping [1 ]
机构
[1] Yale Univ, Sch Med, Dept Epidemiol & Publ Hlth, New Haven, CT 06520 USA
基金
美国国家卫生研究院;
关键词
GENETICALLY COMPLEX TRAITS; FACTOR-H POLYMORPHISM; ASSOCIATION ANALYSIS; LINKAGE STRATEGIES; GENE POLYMORPHISMS; CLASSIFICATION; RISK; HAPLOTYPES; EXPRESSION; SELECTION;
D O I
10.1093/bioinformatics/btq038
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: High-dimensional data are frequently generated in genome-wide association studies (GWAS) and other studies. It is important to identify features such as single nucleotide polymorphisms (SNPs) in GWAS that are associated with a disease. Random forests represent a very useful approach for this purpose, using a variable importance score. This importance score has several shortcomings. We propose an alternative importance measure to overcome those shortcomings. Results: We characterized the effect of multiple SNPs under various models using our proposed importance measure in random forests, which uses maximal conditional chi-square (MCC) as a measure of association between a SNP and the trait conditional on other SNPs. Based on this importance measure, we employed a permutation test to estimate empirical P-values of SNPs. Our method was compared to a univariate test and the permutation test using the Gini and permutation importance. In simulation, the proposed method performed consistently superior to the other methods in identifying of risk SNPs. In a GWAS of age-related macular degeneration, the proposed method confirmed two significant SNPs (at the genome-wide adjusted level of 0.05). Further analysis showed that these two SNPs conformed with a heterogeneity model. Compared with the existing importance measures, the MCC importance measure is more sensitive to complex effects of risk SNPs by utilizing conditional information on different SNPs. The permutation test with the MCC importance measure provides an efficient way to identify candidate SNPs in GWAS and facilitates the understanding of the etiology between genetic variants and complex diseases.
引用
收藏
页码:831 / 837
页数:7
相关论文
共 33 条
[1]   Enriched random forests [J].
Amaratunga, Dhammika ;
Cabrera, Javier ;
Lee, Yung-Seop .
BIOINFORMATICS, 2008, 24 (18) :2010-2014
[2]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]  
Breiman L., 2002, MANUAL SETTING USING, V1
[5]   Identifying SNPs predictive of phenotype using random forests [J].
Bureau, A ;
Dupuis, J ;
Falls, K ;
Lunetta, KL ;
Hayward, B ;
Keith, TP ;
Van Eerdewegh, P .
GENETIC EPIDEMIOLOGY, 2005, 28 (02) :171-182
[6]   A forest-based approach to identifying gene and gene-gene interactions [J].
Chen, Xiang ;
Liu, Ching-Ti ;
Zhang, Meizhuo ;
Zhang, Heping .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (49) :19199-19203
[7]   Was the human genome project worth the effort? [J].
Daiger, SP .
SCIENCE, 2005, 308 (5720) :362-364
[8]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)
[9]   Complement factor H polymorphism and age-related macular degeneration [J].
Edwards, AO ;
Ritter, R ;
Abel, KJ ;
Manning, A ;
Panhuysen, C ;
Farrer, LA .
SCIENCE, 2005, 308 (5720) :421-424
[10]   Greedy function approximation: A gradient boosting machine [J].
Friedman, JH .
ANNALS OF STATISTICS, 2001, 29 (05) :1189-1232