Detecting gene-gene interactions using a permutation-based random forest method

被引:54
作者
Li, Jing [1 ]
Malley, James D. [2 ]
Andrew, Angeline S. [3 ]
Karagas, Margaret R. [3 ]
Moore, Jason H. [4 ,5 ]
机构
[1] Dartmouth Coll, Geisel Sch Med, Dept Genet, Hanover, NH 03755 USA
[2] NIH, Div Computat Biosci, Ctr Informat Technol, Bldg 10, Bethesda, MD 20892 USA
[3] Dartmouth Coll, Geisel Sch Med, Dept Epidemiol, Hanover, NH 03755 USA
[4] Univ Penn, Inst Biomed Informat, Philadelphia, PA 19104 USA
[5] Univ Penn, Perelman Sch Med, Dept Biostat & Epidemiol, Philadelphia, PA 19104 USA
来源
BIODATA MINING | 2016年 / 9卷
基金
美国国家卫生研究院;
关键词
Random forest; GWAS; Machine learning; Scale invariant; MULTIFACTOR DIMENSIONALITY REDUCTION; GENOME-WIDE ASSOCIATION; EPISTATIC MODELS; RISK; DISEASE; STRICT; PURE; SNPS;
D O I
10.1186/s13040-016-0093-5
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a permutation-based methodology relying on a machine learning method, random forest (RF), to detect gene-gene interactions. Our approach called permuted random forest (pRF) which identified the top interacting single nucleotide polymorphism (SNP) pairs by estimating how much the power of a random forest classification model is influenced by removing pairwise interactions. Results: We systematically tested our approach on a simulation study with datasets possessing various genetic constraints including heritability, number of SNPs, sample size, etc. Our methodology showed high success rates for detecting the interaction SNP pair. We also applied our approach to two bladder cancer datasets, which showed consistent results with well-studied methodologies, such as multifactor dimensionality reduction (MDR) and statistical epistasis network (SEN). Furthermore, we built permuted random forest networks (PRFN), in which we used nodes to represent SNPs and edges to indicate interactions. Conclusions: We successfully developed a scale-invariant methodology to detect pure gene-gene interactions based on permutation strategies and the machine learning method random forest. This methodology showed great potential to be used for detecting gene-gene interactions to study underlying genetic architectures in a scale-free way, which could be benefit to uncover the complex disease mechanisms.
引用
收藏
页数:17
相关论文
共 41 条
[31]   Epistasis and Its Implications for Personal Genetics [J].
Moore, Jason H. ;
Williams, Scott M. .
AMERICAN JOURNAL OF HUMAN GENETICS, 2009, 85 (03) :309-320
[32]   Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity [J].
Ritchie, MD ;
Hahn, LW ;
Moore, JH .
GENETIC EPIDEMIOLOGY, 2003, 24 (02) :150-157
[33]   On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data [J].
Schwarz, Daniel F. ;
Koenig, Inke R. ;
Ziegler, Andreas .
BIOINFORMATICS, 2010, 26 (14) :1752-1758
[34]  
Spitz MR, 2001, CANCER RES, V61, P1354
[35]   Transcription onset of genes critical in liver carcinogenesis is epigenetically regulated by methylated DNA-binding protein MBD2 [J].
Stefanska, Barbara ;
Suderman, Matthew ;
Machnes, Ziv ;
Bhattacharyya, Bishnu ;
Hallett, Michael ;
Szyf, Moshe .
CARCINOGENESIS, 2013, 34 (12) :2738-2749
[36]   GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures [J].
Urbanowicz, Ryan J. ;
Kiralis, Jeff ;
Sinnott-Armstrong, Nicholas A. ;
Heberling, Tamra ;
Fisher, Jonathan M. ;
Moore, Jason H. .
BIODATA MINING, 2012, 5
[37]   Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection [J].
Urbanowicz, Ryan J. ;
Kiralis, Jeff ;
Fisher, Jonathan M. ;
Moore, Jason H. .
BIODATA MINING, 2012, 5
[38]   A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction [J].
Velez, Digna R. ;
White, Bill C. ;
Motsinger, Alison A. ;
Bush, William S. ;
Ritchie, Marylyn D. ;
Williams, Scott M. ;
Moore, Jason H. .
GENETIC EPIDEMIOLOGY, 2007, 31 (04) :306-315
[39]   Genome-wide association studies: Theoretical and practical concerns [J].
Wang, WYS ;
Barratt, BJ ;
Clayton, DG ;
Todd, JA .
NATURE REVIEWS GENETICS, 2005, 6 (02) :109-118
[40]   SNP interaction detection with Random Forests in high-dimensional genetic data [J].
Winham, Stacey J. ;
Colby, Colin L. ;
Freimuth, Robert R. ;
Wang, Xin ;
de Andrade, Mariza ;
Huebner, Marianne ;
Biernacka, Joanna M. .
BMC BIOINFORMATICS, 2012, 13