One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

被引:58
作者
Valiente-Mullor, Carlos [1 ]
Beamud, Beatriz [1 ]
Ansari, Ivan [1 ]
Frances-Cuesta, Carlos [1 ]
Garcia-Gonzalez, Neris [1 ]
Mejia, Lorena [1 ,2 ]
Ruiz-Hueso, Paula [1 ]
Gonzalez-Candelas, Fernando [1 ,3 ]
机构
[1] Univ Valencia, FISABIO, Infect & Publ Hlth, Inst Integrat Syst Biol I2SysBio,Joint Res Unit, Valencia, Spain
[2] Univ San Francisco Quito, Colegio Ciencias Biol & Ambient, Inst Microbiol, Quito, Ecuador
[3] CIBER Epidmiol & Publ Hlth, Valencia, Spain
关键词
PHYLOGENETIC ANALYSIS; ALIGNMENT; TRANSMISSION; EVOLUTION; VIRULENCE; OUTBREAK; CHOICE; TOOLS; MRSA;
D O I
10.1371/journal.pcbi.1008678
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended. Author summary Mapping consists in the alignment of reads (i.e., DNA fragments) obtained through high-throughput genome sequencing to a previously assembled reference sequence. It is a common practice in genomic studies to use a single reference for mapping, usually the 'reference genome' of a species-a high-quality assembly. However, the selection of an optimal reference is hindered by intrinsic intra-species genetic variability, particularly in bacteria. It is known that genetic differences between the reference genome and the read sequences may produce incorrect alignments during mapping. Eventually, these errors could lead to misidentification of variants and biased reconstruction of phylogenetic trees (which reflect ancestry between different bacterial lineages). To our knowledge, this is the first work to systematically examine the effect of different references for mapping on the inference of tree topology as well as the impact on recombination and natural selection inferences. Furthermore, the novelty of this work relies on a procedure that guarantees that we are evaluating only the effect of the reference. This effect has proved to be pervasive in the five bacterial species that we have studied and, in some cases, alterations in phylogenetic trees could lead to incorrect epidemiological inferences. Hence, the use of different reference genomes may be prescriptive to assess the potential biases of mapping.
引用
收藏
页数:29
相关论文
共 96 条
[1]   Evaluating the use of whole-genome sequencing for outbreak investigations in the lack of closely related reference genome [J].
Abdelbary, Mohamed M. H. ;
Senn, Laurence ;
Moulin, Estelle ;
Prod'hom, Guy ;
Croxatto, Antony ;
Greub, Gilbert ;
Blanc, Dominique S. .
INFECTION GENETICS AND EVOLUTION, 2018, 59 :1-6
[2]   Pangenome of Serratia marcescens strains from nosocomial and environmental origins reveals different populations and the links between them [J].
Abreo, Eduardo ;
Altier, Nora .
SCIENTIFIC REPORTS, 2019, 9 (1)
[3]   Limitations of next-generation genome sequence assembly [J].
Alkan, Can ;
Sajjadian, Saba ;
Eichler, Evan E. .
NATURE METHODS, 2011, 8 (01) :61-65
[4]   Practical Value of Food Pathogen Traceability through Building a Whole-Genome Sequencing Network and Database [J].
Allard, Marc W. ;
Strain, Errol ;
Melka, David ;
Bunning, Kelly ;
Musser, Steven M. ;
Brown, Eric W. ;
Timme, Ruth .
JOURNAL OF CLINICAL MICROBIOLOGY, 2016, 54 (08) :1975-1983
[5]  
[Anonymous], 2018, R LANG ENV STAT COMP
[6]  
Benson DA, 2013, NUCLEIC ACIDS RES, V41, pD36, DOI [10.1093/nar/gks1195, 10.1093/nar/gkx1094, 10.1093/nar/gkw1070, 10.1093/nar/gkn723, 10.1093/nar/gkl986, 10.1093/nar/gkg057, 10.1093/nar/gkp1024, 10.1093/nar/gkr1202, 10.1093/nar/gkq1079]
[7]   Genomic perspectives on the evolution and spread of bacterial pathogens [J].
Bentley, Stephen D. ;
Parkhill, Julian .
PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2015, 282 (1821)
[8]   Automated Reconstruction of Whole-Genome Phylogenies from Short-Sequence Reads [J].
Bertels, Frederic ;
Silander, Olin K. ;
Pachkov, Mikhail ;
Rainey, Paul B. ;
van Nimwegen, Erik .
MOLECULAR BIOLOGY AND EVOLUTION, 2014, 31 (05) :1077-1088
[9]   TreeCmp: Comparison of Trees in Polynomial Time [J].
Bogdanowicz, Damian ;
Giaro, Krzysztof ;
Wrobel, Borys .
EVOLUTIONARY BIOINFORMATICS, 2012, 8 :475-487
[10]   Next-generation sequencing as a tool to study microbial evolution [J].
Brockhurst, Michael A. ;
Colegrave, Nick ;
Rozen, Daniel E. .
MOLECULAR ECOLOGY, 2011, 20 (05) :972-980