Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

被引:59
作者
Blackwell, Grace [1 ,2 ]
Hunt, Martin R. [1 ,3 ]
Malone, Kerri [1 ]
Lima, Leandro [1 ]
Horesh, Gal [2 ,5 ]
Alako, Blaise T. F. [1 ]
Thomson, Nicholas [2 ,4 ]
Iqbal, Zamin [1 ]
机构
[1] EMBL EBI, Wellcome Genome Campus, Hinxton, England
[2] Wellcome Sanger Inst, Wellcome Genome Campus, Hinxton, England
[3] Univ Oxford, Nuffield Dept Med, Oxford, England
[4] London Sch Hyg & Trop Med, London, England
[5] Chesterford Res Pk, Cambridge, England
基金
比尔及梅琳达.盖茨基金会; 英国惠康基金;
关键词
PROKARYOTIC GENOME ANNOTATION; RESISTANCE; MECHANISMS; RESOURCE; DISEASE; QUALITY;
D O I
10.1371/journal.pbio.3001421
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.
引用
收藏
页数:16
相关论文
共 60 条
  • [1] Occurrence of Corynebacterium striatum as an emerging antibiotic-resistant nosocomial pathogen in a Tunisian hospital
    Alibi, Sana
    Ferjani, Asma
    Boukadida, Jalel
    Eliecer Cano, Maria
    Fernandez-Martinez, Marta
    Martinez-Martinez, Luis
    Navas, Jesus
    [J]. SCIENTIFIC REPORTS, 2017, 7
  • [2] SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
    Bankevich, Anton
    Nurk, Sergey
    Antipov, Dmitry
    Gurevich, Alexey A.
    Dvorkin, Mikhail
    Kulikov, Alexander S.
    Lesin, Valery M.
    Nikolenko, Sergey I.
    Son Pham
    Prjibelski, Andrey D.
    Pyshkin, Alexey V.
    Sirotkin, Alexander V.
    Vyahhi, Nikolay
    Tesler, Glenn
    Alekseyev, Max A.
    Pevzner, Pavel A.
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2012, 19 (05) : 455 - 477
  • [3] ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping
    Beghain, Johann
    Bridier-Nahmias, Antoine
    Le Nagard, Herve
    Denamur, Erick
    Clermont, Olivier
    [J]. MICROBIAL GENOMICS, 2018, 4 (07):
  • [4] Atypical organizations and epistatic interactions of CRISPRs and cas clusters in genomes and their mobile genetic elements
    Bernheim, Aude
    Bikard, David
    Touchon, Marie
    Rocha, Eduardo P. C.
    [J]. NUCLEIC ACIDS RESEARCH, 2020, 48 (02) : 748 - 760
  • [5] Integrative Conjugative Element ICEHs1 Encodes for Antimicrobial Resistance and Metal Tolerance in Histophilus somni
    Bhatt, Krishna
    Timsit, Edouard
    Rawlyk, Neil
    Potter, Andrew
    Liljebjelke, Karen
    [J]. FRONTIERS IN VETERINARY SCIENCE, 2018, 5
  • [6] Bingmann Timo, 2019, String Processing and Information Retrieval. 26th International Symposium, SPIRE 2019. Proceedings. Lecture Notes in Computer Science (LNCS 11811), P285, DOI 10.1007/978-3-030-32686-9_21
  • [7] Reminder to deposit DNA sequences
    Blaxter, Mark
    Danchin, Antoine
    Savakis, Babis
    Fukami-Kobayashi, Kaoru
    Kurokawa, Ken
    Sugano, Sumio
    Roberts, Richard J.
    Salzberg, Steven L.
    Wu, Chung-I
    [J]. SCIENCE, 2016, 352 (6287) : 780 - 780
  • [8] Trimmomatic: a flexible trimmer for Illumina sequence data
    Bolger, Anthony M.
    Lohse, Marc
    Usadel, Bjoern
    [J]. BIOINFORMATICS, 2014, 30 (15) : 2114 - 2120
  • [9] Ultrafast search of all deposited bacterial and viral genomic data
    Bradley, Phelim
    den Bakker, Henk C.
    Rocha, Eduardo P. C.
    McVean, Gil
    Iqbal, Zamin
    [J]. NATURE BIOTECHNOLOGY, 2019, 37 (02) : 152 - +
  • [10] A review of methods and databases for metagenomic classification and assembly
    Breitwieser, Florian P.
    Lu, Jennifer
    Salzberg, Steven L.
    [J]. BRIEFINGS IN BIOINFORMATICS, 2019, 20 (04) : 1125 - 1139