Viral taxonomy derived from evolutionary genome relationships

被引:6
作者
Dougan, Tyler J. [1 ,5 ]
Quake, Stephen R. [2 ,3 ,4 ]
机构
[1] Stanford Univ, Dept Phys, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Bioengn, Stanford, CA 94305 USA
[3] Stanford Univ, Dept Appl Phys, Stanford, CA 94305 USA
[4] Chan Zuckerberg Biohub, Stanford, CA 94305 USA
[5] Harvard MIT Program Hlth Sci & Technol, Boston, MA USA
关键词
VIRUS TAXONOMY; DATABASE; SEARCH; SEQUENCE; TREE;
D O I
10.1371/journal.pone.0220440
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
We describe a new genome alignment-based model for understanding the diversity of viruses based on evolutionary genetic relationships. This approach uses information theory and a physical model to determine the information shared by the genes in two genomes. Pairwise comparisons of genes from the viruses are created from alignments using NCBI BLAST, and their match scores are combined to produce a metric between genomes, which is in turn used to determine a global classification using the 5,817 viruses on RefSeq. In cases where there is no measurable alignment between any genes, the method falls back to a coarser measure of genome relationship: the mutual information of 4-mer frequency. This results in a principled model which depends only on the genome sequence, which captures many interesting relationships between viral families, and which creates clusters which correlate well with both the Baltimore and ICTV classifications. The incremental computational cost of classifying a novel virus is low and therefore newly discovered viruses can be quickly identified and classified. The model goes beyond alignment-free classifications by producing a full phylogeny similar to those constructed by virologists using qualitative features, while relying only on objective genes. These results bolster the case for mathematical models in microbiology which can characterize organisms using only their genetic material and provide an independent check for phylogenies constructed by humans, considerably faster and more cheaply than less modern approaches.
引用
收藏
页数:17
相关论文
共 36 条
[1]   The genomic underpinnings of eukaryotic virus taxonomy: creating a sequence-based framework for family-level virus classification [J].
Aiewsakun, Pakorn ;
Simmonds, Peter .
MICROBIOME, 2018, 6
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]  
Ankerst M., 1999, SIGMOD Record, V28, P49, DOI 10.1145/304181.304187
[5]   Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach [J].
Anwar, Firoz ;
Baker, Syed Murtuza ;
Jabid, Taskeed ;
Hasan, Md Mehedi ;
Shoyaib, Mohammad ;
Khan, Haseena ;
Walshe, Ray .
BMC BIOINFORMATICS, 2008, 9 (1)
[6]   vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria [J].
Bolduc, Benjamin ;
Jang, Ho Bin ;
Doulcier, Guilhem ;
You, Zhi-Qiang ;
Roux, Simon ;
Sullivan, Matthew B. .
PEERJ, 2017, 5
[7]   BLAST plus : architecture and applications [J].
Camacho, Christiam ;
Coulouris, George ;
Avagyan, Vahram ;
Ma, Ning ;
Papadopoulos, Jason ;
Bealer, Kevin ;
Madden, Thomas L. .
BMC BIOINFORMATICS, 2009, 10
[8]   GenBank [J].
Clark, Karen ;
Karsch-Mizrachi, Ilene ;
Lipman, David J. ;
Ostell, James ;
Sayers, Eric W. .
NUCLEIC ACIDS RESEARCH, 2016, 44 (D1) :D67-D72
[9]  
Cover T.M., 2006, ELEMENTS INFORM THEO ELEMENTS INFORM THEO, V2 nd, DOI 10.1002/0471200611
[10]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+