Improving hazard characterization in microbial risk assessment using next generation sequencing data and machine learning: Predicting clinical outcomes in shigatoxigenic Escherichia coli

被引:39
作者
Njage, Patrick Murigu Kamau [1 ]
Leekitcharoenphon, Pimlapas [1 ]
Hald, Tine [1 ]
机构
[1] Tech Univ Denmark, Natl Food Inst, Res Grp Genom Epidemiol, Bldg 204, DK-2800 Lyngby, Denmark
关键词
Hazard characterization; Hazard identification; Risk characterization; Infection outcome; STEC; Whole genome sequencing; Logit boost; NON-O157 SHIGA TOXIN; SECRETION SYSTEMS; O157; VIRULENCE; IDENTIFICATION; INFECTIONS; REGRESSION; CLASSIFIER; EXPRESSION; PROTEIN;
D O I
10.1016/j.ijfoodmicro.2018.11.016
中图分类号
TS2 [食品工业];
学科分类号
0832 ;
摘要
The ever decreasing cost and increase in throughput of next generation sequencing (NGS) techniques have resulted in a rapid increase in availability of NGS data. Such data have the potential for rapid, reproducible and highly discriminative characterization of pathogens. This provides an opportunity in microbial risk assessment to account for variations in survivability and virulence among strains. A major challenge towards such attempts remains the highly dimensional nature of genomic data versus the number of isolates. Machine learning-based (ML) predictive risk modelling provides a solution to this "curse of dimensionality" while accounting for individual effects that are dependent on interactions with other genetic and environmental factors. This pilot study explores the potential of ML in the prediction of health endpoints resulting from shigatoxigenic E. coil (STEC) infection. Accessory genes in amino acid sequences were used as model input to predict and differentiate health outcomes in STEC infections including diarrhea, bloody diarrhea, hemolytic uremic syndrome and their combinations. Outcomes severity was also distinguished by hospitalization. A matrix of percent similarity between accessory genes and the E. ea genomes was generated and subsequently used as input for ML. The performances of ML algorithms random forest, support vector machine (radial and linear kernel), gradient boosting, and logit boost were compared. Logit boost was the best model showing an outcome prediction accuracy of 0.75 (95% CI:0.60, 0.86), an excellent or substantial performance (Kappa = 0.72). Important genetic predictors of riskier STEC clinical outcomes included proteins involved in initial attachment to the host cell, persistence of plasmids or genomic islands, conjugative plasmid transfer and formation of sex pili, regulation of locus of enterocyte effacement expression, post-translational acetylation of proteins, facilitation of the rearrangement or deletion of sections within the pathogenic islands and transport macromolecules across the cell envelope. We propose further studies are proposed on the proteins with undefined or unclear functionality. One protein family in particular predicted HUS outcome. Toxin-antitoxin systems are potential stress adaptation markers which may mediate environmental persistence of strains in diverse sources. We foresee the application of ML approach to the set-up of real-time online analysis of whole genome sequence data to estimate the human health risk at the population or strain level. The ML approach is envisaged to support the prediction of more specific STEC clinical endpoints type by inputting isolate sequence data.
引用
收藏
页码:72 / 82
页数:11
相关论文
共 78 条
[1]   Impact of genomics on microbial food safety [J].
Abee, T ;
van Schaik, W ;
Siezen, RJ .
TRENDS IN BIOTECHNOLOGY, 2004, 22 (12) :653-660
[2]   DIAGNOSTIC-TESTS-2 - PREDICTIVE VALUES .4. [J].
ALTMAN, DG ;
BLAND, JM .
BRITISH MEDICAL JOURNAL, 1994, 309 (6947) :102-102
[3]  
[Anonymous], 1988, STAT POWER ANAL BEHA
[4]  
[Anonymous], J BIOL CHEM
[5]  
[Anonymous], BMC P
[6]   Crystallization and preliminary diffraction studies TraF, a component of the Escherichia coli type IV secretory system [J].
Audette, GF ;
Holland, SJ ;
Elton, TC ;
Manchak, J ;
Hayakawa, K ;
Frost, LS ;
Hazes, B .
ACTA CRYSTALLOGRAPHICA SECTION D-STRUCTURAL BIOLOGY, 2004, 60 :2025-2027
[7]   Genome update:: proteome comparisons [J].
Binnewies, TT ;
Hallin, PF ;
Stærfeldt, HH ;
Ussery, DW .
MICROBIOLOGY-SGM, 2005, 151 :1-4
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]   Statistical modeling: The two cultures [J].
Breiman, L .
STATISTICAL SCIENCE, 2001, 16 (03) :199-215
[10]  
Brodersen Kay H., 2010, Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR 2010), P3121, DOI 10.1109/ICPR.2010.764