Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20

被引:2
作者
Darst, Burcu [1 ]
Engelman, Corinne D. [1 ]
Tian, Ye [2 ,3 ]
Bermejo, Justo Lorenzo [4 ]
机构
[1] Univ Wisconsin, Sch Med & Publ Hlth, Dept Populat Hlth Sci, 610 Walnut St 1007 WARF, Madison, WI 53726 USA
[2] Univ Manitoba, Dept Biochem & Med Genet, 745 Bannatyne Ave, Winnipeg, MB R3E 0J9, Canada
[3] Univ Manitoba, Dept Elect & Comp Engn, 745 Bannatyne Ave, Winnipeg, MB R3E 0J9, Canada
[4] Heidelberg Univ, Inst Med Biometry & Informat, Neuenheimer Feld 130-3, D-69120 Heidelberg, Germany
来源
BMC GENETICS | 2018年 / 19卷
关键词
Data mining; Machine learning; Genome-wide association study; Epigenome-wide association study;
D O I
10.1186/s12863-018-0646-3
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Background: Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation array data. Results: We provide a non-intimidating introduction to some frequently used methods to investigate high-dimensional molecular data and compare the different approaches tried by group members: random forest, deep learning, cluster analysis, mixed models, and gene-set enrichment analysis. Group contributions were quite heterogeneous regarding investigated data sets (real vs simulated), conducted data quality control and assessed phenotypes (eg, metabolic syndrome vs relative differences of log-transformed triglyceride concentrations before and after fenofibrate treatment). However, some common technical issues were detected, leading to practical recommendations. Conclusions: Different sources of correlation were identified by group members, including population stratification, family structure, batch effects, linkage disequilibrium and correlation of methylation values at neighboring cytosine-phosphate-guanine (CpG) sites, and the majority of applied approaches were able to take into account identified correlation structures. The ability to efficiently deal with high-dimensional omics data, and the model free nature of the approaches that did not require detailed model specifications were clearly recognized as the main strengths of applied methods. A limitation of random forest is its sensitivity to highly correlated variables. The parameter setup and the interpretation of results from deep learning methods, in particular deep neural networks, can be extremely challenging. Cluster analysis and mixed models may need some predimension reduction based on existing literature, data filtering, and supplementary statistical methods, and gene-set enrichment analysis requires biological insight.
引用
收藏
页数:8
相关论文
共 29 条
[21]   Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data [J].
Sugolov, Anton ;
Emmenegger, Eric ;
Paterson, Andrew D. ;
Sun, Lei .
STATISTICS IN BIOSCIENCES, 2024, 16 (01) :250-264
[22]   Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data [J].
Anton Sugolov ;
Eric Emmenegger ;
Andrew D. Paterson ;
Lei Sun .
Statistics in Biosciences, 2024, 16 :250-264
[23]   The use of class imbalanced learning methods on ULSAM data to predict the case-control status in genome-wide association studies [J].
Oztornaci, R. Onur ;
Syed, Hamzah ;
Morris, Andrew P. ;
Tasdelen, Bahar .
JOURNAL OF BIG DATA, 2023, 10 (01)
[24]   Two non-synonymous markers in PTPN21, identified by genome-wide association study data-mining and replication, are associated with schizophrenia [J].
Chen, Jingchun ;
Lee, Grace ;
Fanous, Ayman H. ;
Zhao, Zhongming ;
Jia, Peilin ;
O'Neill, Anthony ;
Walsh, Dermot ;
Kendler, Kenneth S. ;
Chen, Xiangning .
SCHIZOPHRENIA RESEARCH, 2011, 131 (1-3) :43-51
[25]   Prediction and validation of protein-protein interactors from genome-wide DNA-binding data using a knowledge-based machine-learning approach [J].
Waardenberg, Ashley J. ;
Homan, Bernou ;
Mohamed, Stephanie ;
Harvey, Richard P. ;
Bouveret, Romaric .
OPEN BIOLOGY, 2016, 6 (09)
[26]   Combinatorial and Machine Learning Approaches for Improved Somatic Variant Calling From Formalin-Fixed Paraffin-Embedded Genome Sequence Data [J].
Dodani, Dollina D. ;
Nguyen, Matthew H. ;
Morin, Ryan D. ;
Marra, Marco A. ;
Corbett, Richard D. .
FRONTIERS IN GENETICS, 2022, 13
[27]   Evaluation of Machine Learning and Rules-Based Approaches for Predicting Antimicrobial Resistance Profiles in Gram-negative Bacilli from Whole Genome Sequence Data [J].
Pesesky, Mitchell W. ;
Hussain, Tahir ;
Wallace, Meghan ;
Patel, Sanket ;
Andleeb, Saadia ;
Burnham, Carey-Ann D. ;
Dantas, Gautam .
FRONTIERS IN MICROBIOLOGY, 2016, 7
[28]   DNA methylation regulator-mediated modification patterns and risk of intracranial aneurysm: a multi-omics and epigenome-wide association study integrating machine learning, Mendelian randomization, eQTL and mQTL data [J].
Maimaiti, Aierpati ;
Turhon, Mirzat ;
Abulaiti, Aimitaji ;
Dilixiati, Yilidanna ;
Zhang, Fujunhui ;
Axieer, Aximujiang ;
Kadeer, Kaheerman ;
Zhang, Yisen ;
Maimaitili, Aisha ;
Yang, Xinjian .
JOURNAL OF TRANSLATIONAL MEDICINE, 2023, 21 (01)
[29]   DNA methylation regulator-mediated modification patterns and risk of intracranial aneurysm: a multi-omics and epigenome-wide association study integrating machine learning, Mendelian randomization, eQTL and mQTL data [J].
Aierpati Maimaiti ;
Mirzat Turhon ;
Aimitaji Abulaiti ;
Yilidanna Dilixiati ;
Fujunhui Zhang ;
Aximujiang Axieer ;
Kaheerman Kadeer ;
Yisen Zhang ;
Aisha Maimaitili ;
Xinjian Yang .
Journal of Translational Medicine, 21