Predicting variable gene content in Escherichia coli using conserved genes

被引:1
作者
Nguyen, Marcus [1 ,2 ]
Elmore, Zachary [3 ]
Ihle, Clay [3 ]
Moen, Francesco S. [3 ]
Slater, Adam D. [3 ]
Turner, Benjamin N. [3 ]
Parrello, Bruce [2 ,4 ]
Best, Aaron A. [3 ]
Davis, James J. [1 ,2 ]
机构
[1] Argonne Natl Lab, Data Sci & Learning Div, Lemont, IL 60439 USA
[2] Univ Chicago, Consortium Adv Sci & Engn, Chicago, IL 60637 USA
[3] Hope Coll, Biol Dept, Holland, MI USA
[4] Fellowship Interpretat Genomes, Burr Ridge, IL USA
基金
美国国家科学基金会;
关键词
machine learning; horizontal gene transfer; antimicrobial resistance; bacterial virulence; phylogeny; FUNCTIONAL PROFILES; QUALITY;
D O I
10.1128/msystems.00058-23
中图分类号
Q93 [微生物学];
学科分类号
071005 ; 100705 ;
摘要
Having the ability to predict the protein-encoding gene content of a genome is important for assessing genome quality, binning genomes from shotgun metagenomic assemblies, and assessing risk due to the presence of antimicrobial resistance and other virulence genes. In this study, we built a set of binary classifiers for predicting the presence or absence of variable genes occurring in 10%-90% of all publicly available E. coli genomes. Overall, the results show that a large portion of the E. coli variable gene content can be predicted with high accuracy, including genes with functions relating to horizontal gene transfer. This study offers a strategy for predicting gene content using limited input sequence data. Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in Escherichia coli genomes using only the nucleotide k-mers from a set of 100 conserved genes as features. Protein families were used to define orthologs, and a single classifier was built for predicting the presence or absence of each protein family occurring in 10%-90% of all E. coli genomes. The resulting set of 3,259 extreme gradient boosting classifiers had a per-genome average macro F1 score of 0.944 [0.943-0.945, 95% CI]. We show that the F1 scores are stable across multi-locus sequence types and that the trend can be recapitulated by sampling a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including "hypothetical proteins" was accurately predicted (F1 = 0.902 [0.898-0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions had slightly lower F1 scores but were still accurate (F1s = 0.895, 0.872, 0.824, and 0.841 for transposon, phage, plasmid, and antimicrobial resistance-related functions, respectively). Finally, using a holdout set of 419 diverse E. coli genomes that were isolated from freshwater environmental sources, we observed an average per-genome F1 score of 0.880 [0.876-0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data. IMPORTANCEHaving the ability to predict the protein-encoding gene content of a genome is important for assessing genome quality, binning genomes from shotgun metagenomic assemblies, and assessing risk due to the presence of antimicrobial resistance and other virulence genes. In this study, we built a set of binary classifiers for predicting the presence or absence of variable genes occurring in 10%-90% of all publicly available E. coli genomes. Overall, the results show that a large portion of the E. coli variable gene content can be predicted with high accuracy, including genes with functions relating to horizontal gene transfer. This study offers a strategy for predicting gene content using limited input sequence data.
引用
收藏
页数:15
相关论文
共 62 条
[1]   CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database [J].
Alcock, Brian P. ;
Raphenya, Amogelang R. ;
Lau, Tammy T. Y. ;
Tsang, Kara K. ;
Bouchard, Megane ;
Edalatmand, Arman ;
Huynh, William ;
Nguyen, Anna-Lisa, V ;
Cheng, Annie A. ;
Liu, Sihan ;
Min, Sally Y. ;
Miroshnichenko, Anatoly ;
Tran, Hiu-Ki ;
Werfalli, Rafik E. ;
Nasir, Jalees A. ;
Oloni, Martins ;
Speicher, David J. ;
Florescu, Alexandra ;
Singh, Bhavya ;
Faltyn, Mateusz ;
Hernandez-Koutoucheva, Anastasia ;
Sharma, Arjun N. ;
Bordeleau, Emily ;
Pawlowski, Andrew C. ;
Zubyk, Haley L. ;
Dooley, Damion ;
Griffiths, Emma ;
Maguire, Finlay ;
Winsor, Geoff L. ;
Beiko, Robert G. ;
Brinkman, Fiona S. L. ;
Hsiao, William W. L. ;
Domselaar, Gary, V ;
McArthur, Andrew G. .
NUCLEIC ACIDS RESEARCH, 2020, 48 (D1) :D517-D525
[2]  
[Anonymous], 2014, Method 1603: Escherichia coli (E. coli) in Water by Membrane Filtration Using Modified membrane- Thermotolerant Escherichia coli Agar (modified mTEC)
[3]  
Method EPA-821
[4]   PATRIC as a unique resource for studying antimicrobial resistance [J].
Antonopoulos, Dionysios A. ;
Assaf, Rida ;
Aziz, Ramy Karam ;
Brettin, Thomas ;
Bun, Christopher ;
Conrad, Neal ;
Davis, James J. ;
Dietrich, Emily M. ;
Disz, Terry ;
Gerdes, Svetlana ;
Kenyon, Ronald W. ;
Machi, Dustin ;
Mao, Chunhong ;
Murphy-Olson, Daniel E. ;
Nordberg, Eric K. ;
Olsen, Gary J. ;
Olson, Robert ;
Overbeek, Ross ;
Parrello, Bruce ;
Pusch, Gordon D. ;
Santerre, John ;
Shukla, Maulik ;
Stevens, Rick L. ;
VanOeffelen, Margo ;
Vonstein, Veronika ;
Warren, Andrew S. ;
Wattam, Alice R. ;
Xia, Fangfang ;
Yoo, Hyunseung .
BRIEFINGS IN BIOINFORMATICS, 2019, 20 (04) :1094-1102
[5]   DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data [J].
Arango-Argoty, Gustavo ;
Garner, Emily ;
Prudent, Amy ;
Heath, Lenwood S. ;
Vikesland, Peter ;
Zhang, Liqing .
MICROBIOME, 2018, 6
[6]   Predicting Antimicrobial Resistance Using Partial Genome Alignments [J].
Aytan-Aktug, D. ;
Nguyen, M. ;
Clausen, P. T. L. C. ;
Stevens, R. L. ;
Aarestrup, F. M. ;
Lund, O. ;
Davis, J. J. .
MSYSTEMS, 2021, 6 (03)
[7]   ResFinder 4.0 for predictions of phenotypes from genotypes [J].
Bortolaia, Valeria ;
Kaas, Rolf S. ;
Ruppe, Etienne ;
Roberts, Marilyn C. ;
Schwarz, Stefan ;
Cattoir, Vincent ;
Philippon, Alain ;
Allesoe, Rosa L. ;
Rebelo, Ana Rita ;
Florensa, Alfred Ferrer ;
Fagelhauer, Linda ;
Chakraborty, Trinad ;
Neumann, Bernd ;
Werner, Guido ;
Bender, Jennifer K. ;
Stingl, Kerstin ;
Minh Nguyen ;
Coppens, Jasmine ;
Xavier, Basil Britto ;
Malhotra-Kumar, Surbhi ;
Westh, Henrik ;
Pinholt, Mette ;
Anjum, Muna F. ;
Duggett, Nicholas A. ;
Kempf, Isabelle ;
Nykasenoja, Suvi ;
Olkkola, Satu ;
Wieczorek, Kinga ;
Amaro, Ana ;
Clemente, Lurdes ;
Mossong, Joel ;
Losch, Serge ;
Ragimbeau, Catherine ;
Lund, Ole ;
Aarestrup, Frank M. .
JOURNAL OF ANTIMICROBIAL CHEMOTHERAPY, 2020, 75 (12) :3491-3500
[8]   Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea [J].
Bowers, Robert M. ;
Kyrpides, Nikos C. ;
Stepanauskas, Ramunas ;
Harmon-Smith, Miranda ;
Doud, Devin ;
Reddy, T. B. K. ;
Schulz, Frederik ;
Jarett, Jessica ;
Rivers, Adam R. ;
Eloe-Fadrosh, Emiley A. ;
Tringe, Susannah G. ;
Ivanova, Natalia N. ;
Copeland, Alex ;
Clum, Alicia ;
Becraft, Eric D. ;
Malmstrom, Rex R. ;
Birren, Bruce ;
Podar, Mircea ;
Bork, Peer ;
Weinstock, George M. ;
Garrity, George M. ;
Dodsworth, Jeremy A. ;
Yooseph, Shibu ;
Sutton, Granger ;
Gloeckner, Frank O. ;
Gilbert, Jack A. ;
Nelson, William C. ;
Hallam, Steven J. ;
Jungbluth, Sean P. ;
Ettema, Thijs J. G. ;
Tighe, Scott ;
Konstantinidis, Konstantinos T. ;
Liu, Wen-Tso ;
Baker, Brett J. ;
Rattei, Thomas ;
Eisen, Jonathan A. ;
Hedlund, Brian ;
McMahon, Katherine D. ;
Fierer, Noah ;
Knight, Rob ;
Finn, Rob ;
Cochrane, Guy ;
Karsch-Mizrachi, Ilene ;
Tyson, Gene W. ;
Rinke, Christian ;
Lapidus, Alla ;
Meyer, Folker ;
Yilmaz, Pelin ;
Parks, Donovan H. ;
Eren, A. M. .
NATURE BIOTECHNOLOGY, 2017, 35 (08) :725-731
[9]   Microbial Communities Can Be Described by Metabolic Structure: A General Framework and Application to a Seasonally Variable, Depth-Stratified Microbial Community from the Coastal West Antarctic Peninsula [J].
Bowman, Jeff S. ;
Ducklow, Hugh W. .
PLOS ONE, 2015, 10 (08)
[10]   RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes [J].
Brettin, Thomas ;
Davis, James J. ;
Disz, Terry ;
Edwards, Robert A. ;
Gerdes, Svetlana ;
Olsen, Gary J. ;
Olson, Robert ;
Overbeek, Ross ;
Parrello, Bruce ;
Pusch, Gordon D. ;
Shukla, Maulik ;
Thomason, James A., III ;
Stevens, Rick ;
Vonstein, Veronika ;
Wattam, Alice R. ;
Xia, Fangfang .
SCIENTIFIC REPORTS, 2015, 5