Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters

被引：44

作者：

Bourel, Mathias ^{[1
,2
]}

Segura, Angel M. ^{[2
]}

Crisci, Carolina ^{[2
]}

Lopez, Guzman ^{[2
]}

Sampognaro, Lia ^{[2
]}

Vidal, Victoria ^{[2
]}

Kruk, Carla ^{[2
,3
,4
]}

Piccini, Claudia ^{[2
,3
]}

Perera, Gonzalo ^{[2
]}

机构：

[1] Univ Republica, Fac Ingn, IMERL, Montevideo, Uruguay

[2] Univ Republ, Ctr Univ Reg Este, Dept Modelizac Estadist Datos & Inteligencia Arti, Rocha, Uruguay

[3] Minist Educ & Cultura, Inst Invest Biol Clemente Estable, Dept Microbiol, Montevideo, Uruguay

[4] Univ Republica, Fac Ciencias, Inst Ecol & Ciencias Ambientales, Montevideo, Uruguay

来源：

WATER RESEARCH | 2021年 / 202卷

关键词：

Machine learning; Faecal coliform; Recreational waters; Prediction; AERUGINOSA COMPLEX MAC; INDICATOR BACTERIA; REGRESSION TREES; FRESH-WATER; CLASSIFICATION; QUALITY; MODEL; CALIFORNIA; RIVER; RISK;

D O I：

10.1016/j.watres.2021.117450

中图分类号：

X [环境科学、安全科学];

学科分类号：

08 ; 0830 ;

摘要：

Predicting water contamination by statistical models is a useful tool to manage health risk in recreational beaches. Extreme contamination events, i.e. those exceeding normative are generally rare with respect to bathing conditions and thus the data is said to be imbalanced. Modeling and predicting those rare events present unique challenges. Here we introduce and evaluate several machine learning techniques and metrics to model imbal-anced data and evaluate model performance. We do so by using a) simulated data-sets and b) a real data base with records of faecal coliform abundance monitored for 10 years in 21 recreational beaches in Uruguay (N approximate to 19000) using in situ and meteorological variables. We discuss advantages and disadvantages of the methods and provide a simple guide to perform models for a general audience. We also provide R codes to reproduce model fitting and testing. We found that most Machine Learning techniques are sensitive to imbalance and require specific data pre-treatment (e.g. upsampling) to improve performance. Accuracy (i.e. correctly classified cases over total cases) is not adequate to evaluate model performance on imbalanced data set. Instead, true positive rates (TPR) and false positive rates (FPR) are recommended. Among the 52 possible candidate algorithms tested, the stratified Random forest presented the better performance improving TPR in 50% with respect to baseline (0.4) and outperformed baseline in the evaluated metrics. Support vector machines combined with upsampling method or synthetic minority oversampling technique (SMOTE) performed well, similar to Adaboost with SMOTE. These results suggests that combining modeling strategies is necessary to improve our capacity to anticipate water contamination and avoid health risk.

引用

页数：11

共 66 条

[1] LIKELIHOOD OF A MODEL AND INFORMATION CRITERIA
AKAIKE, H
[J]. JOURNAL OF ECONOMETRICS, 1981, 16 (01) : 3 - 14
[2] Albers S., 2020, rsoi: Import various northern and southern hemisphere climate indices. R package version 0.5.4
[3] Evaluating statistical model performance in water quality prediction
Avila, Rodelyn
Horn, Beverley
Moriarty, Elaine
Hodson, Roger
Moltchanova, Elena
[J]. JOURNAL OF ENVIRONMENTAL MANAGEMENT, 2018, 206 : 910 - 919
[4] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
Blewitt, Marnie E.
Gendrel, Anne-Valerie
Pang, Zhenyi
Sparrow, Duncan B.
Whitelaw, Nadia
Craig, Jeffrey M.
Apedaile, Anwyn
Hilton, Douglas J.
Dunwoodie, Sally L.
Brockdorff, Neil
Kay, Graham F.
Whitelaw, Emma
[J]. NATURE GENETICS, 2008, 40 (05) : 663 - 669
[5] Multiclass classification methods in ecology
Bourel, M.
Segura, A. M.
[J]. ECOLOGICAL INDICATORS, 2018, 85 : 1012 - 1021
[6] Consensus methods based on machine learning techniques for marine phytoplankton presence-absence prediction
Bourel, M.
Crisci, C.
Martinez, A.
[J]. ECOLOGICAL INFORMATICS, 2017, 42 : 46 - 54
[7] Random forests
Breiman, L
[J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
[8] Predicting recreational water quality advisories: A comparison of statistical methods
Brooks, Wesley
Corsi, Steven
Fienen, Michael
Carvin, Rebecca
[J]. ENVIRONMENTAL MODELLING & SOFTWARE, 2016, 76 : 81 - 94
[9] SMOTE: Synthetic minority over-sampling technique
Chawla, Nitesh V.
Bowyer, Kevin W.
Hall, Lawrence O.
Kegelmeyer, W. Philip
[J]. 2002, American Association for Artificial Intelligence (16)
[10] Chen C., 2004, Using Random Forest to Learn Imbalanced Data

← 1 2 3 4 5 6 7 →