Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data

被引:2
|
作者
Sugolov, Anton [1 ]
Emmenegger, Eric [2 ]
Paterson, Andrew D. [3 ,4 ]
Sun, Lei [4 ,5 ]
机构
[1] Univ Toronto, Fac Arts & Sci, Dept Math, Toronto, ON, Canada
[2] Univ Toronto, Dept Mech & Ind Engn, Toronto, ON, Canada
[3] Univ Toronto, Hosp Sick Children, Program Genet & Genome Biol, Toronto, ON, Canada
[4] Univ Toronto, Dalla Lana Sch Publ Hlth, Toronto, ON, Canada
[5] Univ Toronto, Fac Arts & Sci, Dalla Lana Sch Publ Hlth, Dept Stat Sci, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会; 加拿大健康研究院;
关键词
1000 Genomes Project; Data Visualization; Genome-wide Association Study; Gene Expression; Hands-on Experience; Large-scale Data Analysis; Multiple Hypothesis Testing; Open Resource; Reproducible Research; UK BIOBANK; SCIENCE;
D O I
10.1007/s12561-023-09375-9
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain similar to 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 19 条
  • [1] Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data
    Anton Sugolov
    Eric Emmenegger
    Andrew D. Paterson
    Lei Sun
    Statistics in Biosciences, 2024, 16 : 250 - 264
  • [2] Data validation and statistical issues such as power and other considerations in genome-wide association study (GWAS)
    Tomita, Makoto
    WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2023, 15 (03)
  • [3] Genome-wide association study for ketosis in US Jerseys using producer-recorded data
    Gaddis, K. L. Parker
    Megonigal, J. H., Jr.
    Clay, J. S.
    Wolfe, C. W.
    JOURNAL OF DAIRY SCIENCE, 2018, 101 (01) : 413 - 424
  • [4] Integrating genome-wide association studies and gene expression data highlights dysregulated multiple sclerosis risk pathways
    Liu, Guiyou
    Zhang, Fang
    Jiang, Yongshuai
    Hu, Yang
    Gong, Zhongying
    Liu, Shoufeng
    Chen, Xiuju
    Jiang, Qinghua
    Hao, Junwei
    MULTIPLE SCLEROSIS JOURNAL, 2017, 23 (02) : 205 - 212
  • [5] Predicting allergic diseases in children using genome-wide association study (GWAS) data and family history
    Park, Jaehyun
    Jang, Haerin
    Kim, Mina
    Hong, Jung Yeon
    Kim, Yoon Hee
    Sohn, Myung Hyun
    Park, Sang-Cheol
    Won, Sungho
    Kim, Kyung Won
    WORLD ALLERGY ORGANIZATION JOURNAL, 2021, 14 (05):
  • [6] Integrating genome-wide association study and expression quantitative trait loci data identifies multiple genes and gene set associated with neuroticism
    Fan, Qianrui
    Wang, Wenyu
    Hao, Jingcan
    He, Awen
    Wen, Yan
    Guo, Xiong
    Wu, Cuiyan
    Ning, Yujie
    Wang, Xi
    Wang, Sen
    Zhang, Feng
    PROGRESS IN NEURO-PSYCHOPHARMACOLOGY & BIOLOGICAL PSYCHIATRY, 2017, 78 : 149 - 152
  • [7] Prediction of Stage, Grade, and Survival in Bladder Cancer Using Genome-wide Expression Data: A Validation Study
    Lauss, Martin
    Ringner, Markus
    Hoglund, Mattias
    CLINICAL CANCER RESEARCH, 2010, 16 (17) : 4421 - 4433
  • [8] Genome-wide Pathway Analysis Using Gene Expression Data of Colonic Mucosa in Patients with Inflammatory Bowel Disease
    Palmieri, Orazio
    Creanza, Teresa M.
    Bossa, Fabrizio
    Palumbo, Orazio
    Maglietta, Rosalia
    Ancona, Nicola
    Corritore, Giuseppe
    Latiano, Tiziana
    Martino, Giuseppina
    Biscaglia, Giuseppe
    Scimeca, Daniela
    De Petris, Michele P.
    Carella, Massimo
    Annese, Vito
    Andriulli, Angelo
    Latiano, Anna
    INFLAMMATORY BOWEL DISEASES, 2015, 21 (06) : 1260 - 1268
  • [9] Integrating a genome-wide association study with a large-scale transcriptome analysis to predict genetic regions influencing the glycaemic index and texture in rice
    Anacleto, Roslen
    Badoni, Saurabh
    Parween, Sabiha
    Butardo, Vito M., Jr.
    Misra, Gopal
    Paula Cuevas, Rosa
    Kuhlmann, Markus
    Trinidad, Trinidad P.
    Mallillin, Aida C.
    Acuin, Cecilia
    Bird, Anthony R.
    Morell, Matthew K.
    Sreenivasulu, Nese
    PLANT BIOTECHNOLOGY JOURNAL, 2019, 17 (07) : 1261 - 1275
  • [10] Powerful statistical method to detect disease-associated genes using publicly available genome-wide association studies summary data
    Zhang, Jianjun
    Zhao, Zihan
    Guo, Xuan
    Guo, Bin
    Wu, Baolin
    GENETIC EPIDEMIOLOGY, 2019, 43 (08) : 941 - 951