Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches

被引:5
作者
Alosaimi, Shatha [1 ]
van Biljon, Noelle [2 ]
Awany, Denis [3 ]
Thami, Prisca K. [4 ]
Defo, Joel [4 ]
Mugo, Jacquiline W. [5 ]
Bope, Christian D. [6 ]
Mazandu, Gaston K. [7 ]
Mulder, Nicola J. [8 ,9 ]
Chimusa, Emile R. [7 ]
机构
[1] Univ Cape Town, Div Human Genet, Human Genet, Cape Town, South Africa
[2] Univ Cape Town, Computat Biol Div, Bioinformat, Cape Town, South Africa
[3] Univ Cape Town, Human Genet, Cape Town, South Africa
[4] Univ Cape Town, Div Human Genet, Cape Town, South Africa
[5] Univ Cape Town, Computat Biol Div, Cape Town, South Africa
[6] Univ Kinshasa, Fac Sci, Dept Math & Comp Sci, Kinshasa, DEM REP CONGO
[7] Univ Cape Town, Dept Pathol, Div Human Genet, Cape Town, South Africa
[8] Univ Cape Town UCT, Computat Biol Div, Cape Town, South Africa
[9] PI H3ABioNet, Cape Town, South Africa
基金
英国惠康基金; 美国国家卫生研究院;
关键词
DNA sequence; next-generation sequence; simulation; variant calling; genomics; DISCOVERY; FRAMEWORK; ASSOCIATION; ALGORITHMS; PIPELINES; ALIGNMENT; MUTATION; TOOLS;
D O I
10.1093/bib/bbaa366
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Current variant calling (VC) approaches have been designed to leverage populations of long-range haplotypes and were benchmarked using populations of European descent, whereas most genetic diversity is found in non-European such as Africa populations. Working with these genetically diverse populations, VC tools may produce false positive and false negative results, which may produce misleading conclusions in prioritization of mutations, clinical relevancy and actionability of genes. The most prominent question is which tool or pipeline has a high rate of sensitivity and precision when analysing African data with either low or high sequence coverage, given the high genetic diversity and heterogeneity of this data. Here, a total of 100 synthetic Whole Genome Sequencing (WGS) samples, mimicking the genetics profile of African and European subjects for different specific coverage levels (high/low), have been generated to assess the performance of nine different VC tools on these contrasting datasets. The performances of these tools were assessed in false positive and false negative call rates by comparing the simulated golden variants to the variants identified by each VC tool. Combining our results on sensitivity and positive predictive value (PPV), VarDict [PPV = 0.999 and Matthews correlation coefficient (MCC) = 0.832] and BCFtools (PPV = 0.999 and MCC = 0.813) perform best when using African population data on high and low coverage data. Overall, current VC tools produce high false positive and false negative rates when analysing African compared with European data. This highlights the need for development of VC approaches with high sensitivity and precision tailored for populations characterized by high genetic variations and low linkage disequilibrium.
引用
收藏
页数:9
相关论文
共 40 条
[1]   A broad survey of DNA sequence data simulation tools [J].
Alosaimi, Shatha ;
Bandiang, Armand ;
van Biljon, Noelle ;
Awany, Denis ;
Thami, Prisca K. ;
Tchamga, Milaine S. S. ;
Kiran, Anmol ;
Messaoud, Olfa ;
Hassan, Radia Ismaeel Mohammed ;
Mugo, Jacquiline ;
Ahmed, Azza ;
Bope, Christian D. ;
Allali, Imane ;
Mazandu, Gaston K. ;
Mulder, Nicola J. ;
Chimusa, Emile R. .
BRIEFINGS IN FUNCTIONAL GENOMICS, 2020, 19 (01) :49-59
[2]   Review of Current Methods, Applications, and Data Management for the Bioinformatics Analysis of Whole Exome Sequencing [J].
Bao, Riyue ;
Huang, Lei ;
Andrade, Jorge ;
Tan, Wei ;
Kibbe, Warren A. ;
Jiang, Hongmei ;
Feng, Gang .
CANCER INFORMATICS, 2014, 13 :67-82
[3]   Dissecting in silico Mutation Prediction of Variants in African Genomes: Challenges and Perspectives [J].
Bope, Christian Domilongo ;
Chimusa, Emile R. ;
Nembaware, Victoria ;
Mazandu, Gaston K. ;
de Vries, Jantina ;
Wonkam, Ambroise .
FRONTIERS IN GENETICS, 2019, 10
[4]   African genetic diversity: Implications for human demographic history, modern human origins, and complex disease mapping [J].
Campbell, Michael C. ;
Tishkoff, Sarah A. .
ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS, 2008, 9 :403-433
[5]   A Novel High-Throughput Acceleration Engine for Read Alignment [J].
Chen, Yu-Ting ;
Cong, Jason ;
Lei, Jie ;
Wei, Peng .
2015 IEEE 23RD ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2015, :199-202
[6]   Assessing single nucleotide variant detection and genotype calling on whole-genome sequenced individuals [J].
Cheng, Anthony Youzhi ;
Teo, Yik-Ying ;
Ong, Rick Twee-Hee .
BIOINFORMATICS, 2014, 30 (12) :1707-1713
[7]   A framework for variation discovery and genotyping using next-generation DNA sequencing data [J].
DePristo, Mark A. ;
Banks, Eric ;
Poplin, Ryan ;
Garimella, Kiran V. ;
Maguire, Jared R. ;
Hartl, Christopher ;
Philippakis, Anthony A. ;
del Angel, Guillermo ;
Rivas, Manuel A. ;
Hanna, Matt ;
McKenna, Aaron ;
Fennell, Tim J. ;
Kernytsky, Andrew M. ;
Sivachenko, Andrey Y. ;
Cibulskis, Kristian ;
Gabriel, Stacey B. ;
Altshuler, David ;
Daly, Mark J. .
NATURE GENETICS, 2011, 43 (05) :491-+
[8]   MultiQC: summarize analysis results for multiple tools and samples in a single report [J].
Ewels, Philip ;
Magnusson, Mans ;
Lundin, Sverker ;
Kaller, Max .
BIOINFORMATICS, 2016, 32 (19) :3047-3048
[9]  
Garrison E., 2012, arXiv : 1207.3907
[10]   Field guide to next-generation DNA sequencers [J].
Glenn, Travis C. .
MOLECULAR ECOLOGY RESOURCES, 2011, 11 (05) :759-769