Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water

被引:21
作者
Weller, Daniel Lowell [1 ,2 ,3 ]
Love, Tanzy M. T. [1 ]
Wiedmann, Martin [3 ]
机构
[1] Univ Rochester, Dept Biostat & Computat Biol, Rochester, NY 14627 USA
[2] SUNY Environm Sci & Forestry, Dept Environm & Forest Biol, Syracuse, NY 13210 USA
[3] Cornell Univ, Dept Food Sci, Ithaca, NY 14853 USA
基金
美国食品与农业研究所; 美国国家卫生研究院;
关键词
Listeria; Listeria (L.) monocytogenes; machine learning; predictive modeling; agricultural water; food safety; class imbalance; SMOTE (synthetic minority over-sampling technique); LISTERIA-MONOCYTOGENES; INDICATOR BACTERIA; PRODUCE; CONTAMINATION; PREVALENCE; QUALITY; SALMONELLA; PERFORMANCE; VALIDATION; CHALLENGES;
D O I
10.3389/fenvs.2021.701288
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. "Full models" were trained using all four feature types, while "nested models" used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rulebased learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.
引用
收藏
页数:15
相关论文
共 65 条
[1]  
[Anonymous], 2012, RECR WAT QUAL CRIT
[2]   Evaluating statistical model performance in water quality prediction [J].
Avila, Rodelyn ;
Horn, Beverley ;
Moriarty, Elaine ;
Hodson, Roger ;
Moltchanova, Elena .
JOURNAL OF ENVIRONMENTAL MANAGEMENT, 2018, 206 :910-919
[3]  
Batista G.E., 2004, ACM SIGKDD Explor. Newslett., V6, P20, DOI DOI 10.1145/1007730.1007735
[4]  
Bischl B., 2016, CLASS IMBALANCE CORR, P37, DOI [10.1007/978-3-319-28697-6_6, DOI 10.1007/978-3-319-28697-6_6]
[5]  
Bischl B, 2016, J MACH LEARN RES, V17
[6]   Characterizing relationships among fecal indicator bacteria, microbial source tracking markers, and associated waterborne pathogen occurrence in stream water and sediments in a mixed land use watershed [J].
Bradshaw, J. Kenneth ;
Snyder, Blake J. ;
Oladeinde, Adelumola ;
Spidle, David ;
Berrang, Mark E. ;
Meinersmann, Richard J. ;
Oakley, Brian ;
Sidle, Roy C. ;
Sullivan, Kathleen ;
Molina, Marirosa .
WATER RESEARCH, 2016, 101 :498-509
[7]   Listeriosis outbreak in dairy cattle caused by an unusual Listeria monocytogenes serotype 4b strain [J].
Bundrant, Brittany N. ;
Hutchins, Tony ;
den Bakker, Henk C. ;
Fortes, Esther ;
Wiedmann, Martin .
JOURNAL OF VETERINARY DIAGNOSTIC INVESTIGATION, 2011, 23 (01) :155-158
[8]  
Busta F., 2003, COMPR REV FOOD SCI F, V2, P179, DOI DOI 10.1111/J.1541-4337.2003.TB00035.X
[9]   Comparing Regression Models with Count Data to Artificial Neural Network and Ensemble Models for Prediction of Generic Escherichia coli Population in Agricultural Ponds Based on Weather Station Measurements [J].
Buyrukoglu, Gonca ;
Buyrukoglu, Selim ;
Topalcengiz, Zeynal .
MICROBIAL RISK ANALYSIS, 2021, 19
[10]  
California Leafy Greens Marketing Agreement, 2017, COMMODITY SPECIFIC F