Applying Shannon's information theory to bacterial and phage genomes and metagenomes

被引:20
作者
Akhter, Sajia [1 ]
Bailey, Barbara A. [2 ]
Salamon, Peter [2 ]
Aziz, Ramy K. [3 ,4 ,5 ]
Edwards, Robert A. [1 ,3 ,6 ]
机构
[1] San Diego State Univ, Computat Sci Res Ctr, San Diego, CA 92182 USA
[2] San Diego State Univ, Coll Sci, Dept Math & Stat, San Diego, CA 92182 USA
[3] San Diego State Univ, Coll Sci, Dept Comp Sci, San Diego, CA 92182 USA
[4] Cairo Univ, Fac Pharm, Dept Microbiol & Immunol, Cairo, Egypt
[5] Univ Calif San Diego, Syst Biol Res Grp, La Jolla, CA 92093 USA
[6] Argonne Natl Lab, Div Math & Comp Sci, Argonne, IL 60439 USA
基金
美国国家科学基金会;
关键词
MATHEMATICAL-THEORY; RAST SERVER; ANNOTATION; CHROMOSOME; SYMMETRY;
D O I
10.1038/srep01033
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.
引用
收藏
页数:7
相关论文
共 37 条
[1]   PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies [J].
Akhter, Sajia ;
Aziz, Ramy K. ;
Edwards, Robert A. .
NUCLEIC ACIDS RESEARCH, 2012, 40 (16) :e126
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]   The marine viromes of four oceanic regions [J].
Angly, Florent E. ;
Felts, Ben ;
Breitbart, Mya ;
Salamon, Peter ;
Edwards, Robert A. ;
Carlson, Craig ;
Chan, Amy M. ;
Haynes, Matthew ;
Kelley, Scott ;
Liu, Hong ;
Mahaffy, Joseph M. ;
Mueller, Jennifer E. ;
Nulton, Jim ;
Olson, Robert ;
Parsons, Rachel ;
Rayhawk, Steve ;
Suttle, Curtis A. ;
Rohwer, Forest .
PLOS BIOLOGY, 2006, 4 (11) :2121-2131
[5]  
[Anonymous], INFORM THEORY PRIMER
[6]   The RAST server: Rapid annotations using subsystems technology [J].
Aziz, Ramy K. ;
Bartels, Daniela ;
Best, Aaron A. ;
DeJongh, Matthew ;
Disz, Terrence ;
Edwards, Robert A. ;
Formsma, Kevin ;
Gerdes, Svetlana ;
Glass, Elizabeth M. ;
Kubal, Michael ;
Meyer, Folker ;
Olsen, Gary J. ;
Olson, Robert ;
Osterman, Andrei L. ;
Overbeek, Ross A. ;
McNeil, Leslie K. ;
Paarmann, Daniel ;
Paczian, Tobias ;
Parrello, Bruce ;
Pusch, Gordon D. ;
Reich, Claudia ;
Stevens, Rick ;
Vassieva, Olga ;
Vonstein, Veronika ;
Wilke, Andreas ;
Zagnitko, Olga .
BMC GENOMICS, 2008, 9 (1)
[7]   SEED Servers: High-Performance Access to the SEED Genomes, Annotations, and Metabolic Models [J].
Aziz, Ramy K. ;
Devoid, Scott ;
Disz, Terrence ;
Edwards, Robert A. ;
Henry, Christopher S. ;
Olsen, Gary J. ;
Olson, Robert ;
Overbeek, Ross ;
Parrello, Bruce ;
Pusch, Gordon D. ;
Stevens, Rick L. ;
Vonstein, Veronika ;
Xia, Fangfang .
PLOS ONE, 2012, 7 (10)
[8]   Subsystems-based servers for rapid annotation of genomes and metagenomes [J].
Aziz, Ramy Karam .
BMC BIOINFORMATICS, 2010, 11
[9]  
Benson DA, 2013, NUCLEIC ACIDS RES, V41, pD36, DOI [10.1093/nar/gkn723, 10.1093/nar/gkp1024, 10.1093/nar/gkw1070, 10.1093/nar/gkr1202, 10.1093/nar/gkx1094, 10.1093/nar/gkl986, 10.1093/nar/gkq1079, 10.1093/nar/gks1195, 10.1093/nar/gkg057]
[10]  
Chang CH, 2004, 2004 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS, P20