Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control

被引:1
作者
Betschart, Raphael O. [1 ]
Riccio, Cristian [1 ]
Aguilera-Garcia, Domingo [2 ]
Blankenberg, Stefan [1 ,3 ,4 ]
Guo, Linlin [3 ]
Moch, Holger [2 ]
Seidl, Dagmar [2 ]
Solleder, Hugo [1 ]
Thalen, Felix [1 ]
Thiery, Alexandre [1 ]
Twerenbold, Raphael [3 ,4 ,5 ]
Zeller, Tanja [3 ,4 ,5 ]
Zoche, Martin [2 ]
Ziegler, Andreas [1 ,3 ,4 ,6 ]
机构
[1] Medizincampus Davos, Cardiocare, Medizincampus Davos, Davos, Switzerland
[2] Univ Hosp Zurich, Inst Pathol & Mol Pathol, Zurich, Switzerland
[3] Univ Med Ctr Hamburg Eppendorf, Univ Heart & Vasc Ctr Hamburg, Dept Cardiol, Hamburg, Germany
[4] Univ Med Ctr Hamburg Eppendorf, Univ Heart & Vasc Ctr Hamburg, Ctr Populat Hlth Innovat POINT, Hamburg, Germany
[5] German Ctr Cardiovasc Res DZHK, partner site Hamburg Kiel Lubeck, Hamburg, Germany
[6] Univ KwaZulu Natal, Sch Math Stat & Comp Sci, Pietermaritzburg, South Africa
基金
欧盟地平线“2020”;
关键词
DNA sequencing; DRAGEN; high-throughput sequencing; Illumina NovaSeq 6000; next-generation sequencing; STATISTICAL-ANALYSIS; ASSOCIATION TESTS; BIOINFORMATICS; PERSPECTIVE; CHALLENGES; DIAGNOSIS; SELECTION; VARIANTS; ACCURATE;
D O I
10.1002/bimj.202300278
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg-Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35x using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.
引用
收藏
页数:27
相关论文
共 50 条
[11]   Quality control parameters on a large dataset of regionally dissected human control brains for whole genome expression studies [J].
Trabzuni, Daniah ;
Ryten, Mina ;
Walker, Robert ;
Smith, Colin ;
Imran, Sabaena ;
Ramasamy, Adaikalavan ;
Weale, Michael E. ;
Hardy, John .
JOURNAL OF NEUROCHEMISTRY, 2011, 119 (02) :275-282
[12]   Whole-genome sequencing in the prediction of antimicrobial resistance [J].
Chan, Kok-Gan .
EXPERT REVIEW OF ANTI-INFECTIVE THERAPY, 2016, 14 (07) :617-619
[13]   Performance comparison of whole-genome sequencing platforms [J].
Lam, Hugo Y. K. ;
Clark, Michael J. ;
Chen, Rui ;
Chen, Rong ;
Natsoulis, Georges ;
O'Huallachain, Maeve ;
Dewey, Frederick E. ;
Habegger, Lukas ;
Ashley, Euan A. ;
Gerstein, Mark B. ;
Butte, Atul J. ;
Ji, Hanlee P. ;
Snyder, Michael .
NATURE BIOTECHNOLOGY, 2012, 30 (01) :78-U118
[14]   In Silico Whole Genome Sequencer and Analyzer (iWGS): a Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies [J].
Zhou, Xiaofan ;
Peris, David ;
Kominek, Jacek ;
Kurtzman, Cletus P. ;
Hittinger, Chris Todd ;
Rokas, Antonis .
G3-GENES GENOMES GENETICS, 2016, 6 (11) :3655-3662
[15]   Whole genome sequencing in clinical practice [J].
Bagger, Frederik Otzen ;
Borgwardt, Line ;
Jespersen, Andreas Sand ;
Hansen, Anna Reimer ;
Bertelsen, Birgitte ;
Kodama, Miyako ;
Nielsen, Finn Cilius .
BMC MEDICAL GENOMICS, 2024, 17 (01)
[16]   Whole Genome Sequencing and Newborn Screening [J].
Jeffrey R. Botkin ;
Erin Rothwell .
Current Genetic Medicine Reports, 2016, 4 (1) :1-6
[17]   Longitudinal Data Analysis for Genetic Studies in the Whole-Genome Sequencing Era [J].
Wu, Zheyang ;
Hu, Yijuan ;
Melton, Phillip E. .
GENETIC EPIDEMIOLOGY, 2014, 38 :S74-S80
[18]   The Use of Whole Genome and Next-Generation Sequencing in the Diagnosis of Invasive Fungal Disease [J].
El-Kamand, Sam ;
Papanicolaou, Alexie ;
Morton, C. Oliver .
CURRENT FUNGAL INFECTION REPORTS, 2019, 13 (04) :284-291
[19]   Metagenomics and Whole Genome Sequencing in Clinical Microbiology: A Narrative Review [J].
Thakur, Preeti ;
Verma, Indira .
JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH, 2023, 17 (12) :DE1-DE4
[20]   Personalized pharmacogenomics profiling using whole-genome sequencing [J].
Mizzi, Clint ;
Peters, Brock ;
Mitropoulou, Christina ;
Mitropoulos, Konstantinos ;
Katsila, Theodora ;
Agarwal, Misha R. ;
van Schaik, Ron H. N. ;
Drmanac, Radoje ;
Borg, Joseph ;
Patrinos, George P. .
PHARMACOGENOMICS, 2014, 15 (09) :1223-1234