Gene-gene interaction: the curse of dimensionality

被引:35
作者
Chattopadhyay, Amrita [1 ]
Lu, Tzu-Pin [1 ]
机构
[1] Natl Taiwan Univ, Inst Epidemiol & Prevent Med, Dept Publ Hlth, Taipei, Taiwan
关键词
Gene-gene interaction; parallel computing; PySpark; deep-learning (DL); machine-learning (ML); multifactor dimensionality reduction (MDR); MISSING HERITABILITY; REDUCTION METHOD; NEURAL-NETWORKS; EPISTASIS;
D O I
10.21037/atm.2019.12.87
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Identified genetic variants from genome wide association studies frequently show only modest effects on the disease risk, leading to the "missing heritability" problem. An avenue, to account for a part of this "missingness" is to evaluate gene-gene interactions (epistasis) thereby elucidating their effect on complex diseases. This can potentially help with identifying gene functions, pathways, and drug targets. However, the exhaustive evaluation of all possible genetic interactions among millions of single nucleotide polymorphisms (SNPs) raises several issues, otherwise known as the "curse of dimensionality". The dimensionality involved in the epistatic analysis of such exponentially growing SNPs diminishes the usefulness of traditional, parametric statistical methods. With the immense popularity of multifactor dimensionality reduction (MDR), a non-parametric method, proposed in 2001, that classifies multi-dimensional genotypes into one-dimensional binary approaches, led to the emergence of a fast-growing collection of methods that were based on the MDR approach. Moreover, machine-learning (ML) methods such as random forests and neural networks (NNs), deep-learning (DL) approaches, and hybrid approaches have also been applied profusely, in the recent years, to tackle this dimensionality issue associated with whole genome gene-gene interaction studies. However, exhaustive searching in MDR based approaches or variable selection in ML methods, still pose the risk of missing out on relevant SNPs. Furthermore, interpretability issues are a major hindrance for DL methods. To minimize this loss of information, Python based tools such as PySpark can potentially take advantage of distributed computing resources in the cloud, to bring back smaller subsets of data for further local analysis. Parallel computing can be a powerful resource that stands to fight this "curse". PySpark supports all standard Python libraries and C extensions thus making it convenient to write codes to deliver dramatic improvements in processing speed for extraordinarily large sets of data.
引用
收藏
页数:5
相关论文
共 31 条
[1]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[2]   An integrated map of genetic variation from 1,092 human genomes [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Schmidt, Jeanette P. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Dinh, Huyen ;
Kovar, Christie ;
Lee, Sandra ;
Lewis, Lora ;
Muzny, Donna ;
Reid, Jeff ;
Wang, Min ;
Wang, Jun ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Li, Zhuo ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Su, Zhe ;
Tai, Shuaishuai ;
Tang, Meifang .
NATURE, 2012, 491 (7422) :56-65
[3]  
Bandyopadhyay B, 2017, BIOINFORM BIOL INSIG, V11, DOI 10.1177/1177932217735096
[4]  
Bateson W., 2013, Mendel's Principles of Heredity
[5]   Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies [J].
Botta, Vincent ;
Louppe, Gilles ;
Geurts, Pierre ;
Wehenkel, Louis .
PLOS ONE, 2014, 9 (04)
[6]   Model-Based Multifactor Dimensionality Reduction for detecting epistasis in case-control data in the presence of noise [J].
Cattaert, Tom ;
Calle, M. Luz ;
Dudek, Scott M. ;
John, Jestinah M. Mahachie ;
Van Lishout, Francois ;
Urrea, Victor ;
Ritchie, Marylyn D. ;
Van Steen, Kristel .
ANNALS OF HUMAN GENETICS, 2011, 75 :78-89
[7]   Summarizing techniques that combine three non-parametric scores to detect disease-associated 2-way SNP-SNP interactions [J].
Chattopadhyay, Amrita Sengupta ;
Hsiao, Ching-Lin ;
Chang, Chien Ching ;
Lian, Ie-Bin ;
Fann, Cathy S. J. .
GENE, 2014, 533 (01) :304-312
[8]   A support vector machine approach for detecting gene-gene interaction [J].
Chen, Shyh-Huei ;
Sun, Jielin ;
Dimitrov, Latchezar ;
Turner, Aubrey R. ;
Adams, Tamara S. ;
Meyers, Deborah A. ;
Chang, Bao-Li ;
Zheng, S. Lilly ;
Groenberg, Henrik ;
Xu, Jianfeng ;
Hsu, Fang-Chi .
GENETIC EPIDEMIOLOGY, 2008, 32 (02) :152-167
[9]  
Choi Sungkyoung, 2016, Genomics & Informatics, V14, P138, DOI 10.5808/GI.2016.14.4.138
[10]   Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions [J].
Chung, Yujin ;
Lee, Seung Yeoun ;
Elston, Robert C. ;
Park, Taesung .
BIOINFORMATICS, 2007, 23 (01) :71-76