Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20

被引：2

作者：

Darst, Burcu ^{[1
]}

Engelman, Corinne D. ^{[1
]}

Tian, Ye ^{[2
,3
]}

Bermejo, Justo Lorenzo ^{[4
]}

机构：

[1] Univ Wisconsin, Sch Med & Publ Hlth, Dept Populat Hlth Sci, 610 Walnut St 1007 WARF, Madison, WI 53726 USA

[2] Univ Manitoba, Dept Biochem & Med Genet, 745 Bannatyne Ave, Winnipeg, MB R3E 0J9, Canada

[3] Univ Manitoba, Dept Elect & Comp Engn, 745 Bannatyne Ave, Winnipeg, MB R3E 0J9, Canada

[4] Heidelberg Univ, Inst Med Biometry & Informat, Neuenheimer Feld 130-3, D-69120 Heidelberg, Germany

来源：

BMC GENETICS | 2018年 / 19卷

关键词：

Data mining; Machine learning; Genome-wide association study; Epigenome-wide association study;

D O I：

10.1186/s12863-018-0646-3

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

Background: Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation array data. Results: We provide a non-intimidating introduction to some frequently used methods to investigate high-dimensional molecular data and compare the different approaches tried by group members: random forest, deep learning, cluster analysis, mixed models, and gene-set enrichment analysis. Group contributions were quite heterogeneous regarding investigated data sets (real vs simulated), conducted data quality control and assessed phenotypes (eg, metabolic syndrome vs relative differences of log-transformed triglyceride concentrations before and after fenofibrate treatment). However, some common technical issues were detected, leading to practical recommendations. Conclusions: Different sources of correlation were identified by group members, including population stratification, family structure, batch effects, linkage disequilibrium and correlation of methylation values at neighboring cytosine-phosphate-guanine (CpG) sites, and the majority of applied approaches were able to take into account identified correlation structures. The ability to efficiently deal with high-dimensional omics data, and the model free nature of the approaches that did not require detailed model specifications were clearly recognized as the main strengths of applied methods. A limitation of random forest is its sensitivity to highly correlated variables. The parameter setup and the interpretation of results from deep learning methods, in particular deep neural networks, can be extremely challenging. Cluster analysis and mixed models may need some predimension reduction based on existing literature, data filtering, and supplementary statistical methods, and gene-set enrichment analysis requires biological insight.

引用

页数：8

共 29 条

[1] Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20
Burcu Darst
Corinne D. Engelman
Ye Tian
Justo Lorenzo Bermejo
BMC Genetics, 19
[2] Data mining approaches for genome-wide association of mood disorders
Pirooznia, Mehdi
Seifuddin, Fayaz
Judy, Jennifer
Mahon, Pamela B.
Potash, James B.
Zandi, Peter P.
PSYCHIATRIC GENETICS, 2012, 22 (02) : 55 - 61
[3] Machine learning approaches to genome-wide association studies
Enoma, David O.
Bishung, Janet
Abiodun, Theresa
Ogunlana, Olubanke
Osamor, Victor Chukwudi
JOURNAL OF KING SAUD UNIVERSITY SCIENCE, 2022, 34 (04)
[4] Polygenic modelling and machine learning approaches in pharmacogenomics: Importance in downstream analysis of genome-wide association study data
Koido, Masaru
BRITISH JOURNAL OF CLINICAL PHARMACOLOGY, 2023,
[5] Genome-Wide EST Data Mining Approaches to Resolving Incongruence of Molecular Phylogenies
Shan, Yunfeng
Gras, Robin
ADVANCES IN COMPUTATIONAL BIOLOGY, 2010, 680 : 237 - 243
[6] Revisiting genome-wide association studies from statistical modelling to machine learning
Sun, Shanwen
Dong, Benzhi
Zou, Quan
BRIEFINGS IN BIOINFORMATICS, 2021, 22 (04)
[7] Evaluation of methodology for the analysis of 'time-to-event' data in pharmacogenomic genome-wide association studies
Syed, Hamzah
Jorgensen, Andrea L.
Morris, Andrew P.
PHARMACOGENOMICS, 2016, 17 (08) : 907 - 915
[8] Genetic Architecture of Lung Cancer Using Machine-Learning Approaches in Genome-Wide Association Studies
Byun, J.
Han, Y.
Edelson, J.
Ostrom, Q.
Amos, C.
JOURNAL OF THORACIC ONCOLOGY, 2019, 14 (10) : S516 - S517
[9] Identification of novel therapeutics for complex diseases from genome-wide association data
Mani P Grover
Sara Ballouz
Kaavya A Mohanasundaram
Richard A George
Craig D H Sherman
Tamsyn M Crowley
Merridee A Wouters
BMC Medical Genomics, 7
[10] Genome-wide prediction and prioritization of human aging genes by data fusion: a machine learning approach
Arabfard, Masoud
Ohadi, Mina
Tabar, Vahid Rezaei
Delbari, Ahmad
Kavousi, Kaveh
BMC GENOMICS, 2019, 20 (01)

← 1 2 3 →