Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters

被引:44
作者
Bourel, Mathias [1 ,2 ]
Segura, Angel M. [2 ]
Crisci, Carolina [2 ]
Lopez, Guzman [2 ]
Sampognaro, Lia [2 ]
Vidal, Victoria [2 ]
Kruk, Carla [2 ,3 ,4 ]
Piccini, Claudia [2 ,3 ]
Perera, Gonzalo [2 ]
机构
[1] Univ Republica, Fac Ingn, IMERL, Montevideo, Uruguay
[2] Univ Republ, Ctr Univ Reg Este, Dept Modelizac Estadist Datos & Inteligencia Arti, Rocha, Uruguay
[3] Minist Educ & Cultura, Inst Invest Biol Clemente Estable, Dept Microbiol, Montevideo, Uruguay
[4] Univ Republica, Fac Ciencias, Inst Ecol & Ciencias Ambientales, Montevideo, Uruguay
关键词
Machine learning; Faecal coliform; Recreational waters; Prediction; AERUGINOSA COMPLEX MAC; INDICATOR BACTERIA; REGRESSION TREES; FRESH-WATER; CLASSIFICATION; QUALITY; MODEL; CALIFORNIA; RIVER; RISK;
D O I
10.1016/j.watres.2021.117450
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Predicting water contamination by statistical models is a useful tool to manage health risk in recreational beaches. Extreme contamination events, i.e. those exceeding normative are generally rare with respect to bathing conditions and thus the data is said to be imbalanced. Modeling and predicting those rare events present unique challenges. Here we introduce and evaluate several machine learning techniques and metrics to model imbal-anced data and evaluate model performance. We do so by using a) simulated data-sets and b) a real data base with records of faecal coliform abundance monitored for 10 years in 21 recreational beaches in Uruguay (N approximate to 19000) using in situ and meteorological variables. We discuss advantages and disadvantages of the methods and provide a simple guide to perform models for a general audience. We also provide R codes to reproduce model fitting and testing. We found that most Machine Learning techniques are sensitive to imbalance and require specific data pre-treatment (e.g. upsampling) to improve performance. Accuracy (i.e. correctly classified cases over total cases) is not adequate to evaluate model performance on imbalanced data set. Instead, true positive rates (TPR) and false positive rates (FPR) are recommended. Among the 52 possible candidate algorithms tested, the stratified Random forest presented the better performance improving TPR in 50% with respect to baseline (0.4) and outperformed baseline in the evaluated metrics. Support vector machines combined with upsampling method or synthetic minority oversampling technique (SMOTE) performed well, similar to Adaboost with SMOTE. These results suggests that combining modeling strategies is necessary to improve our capacity to anticipate water contamination and avoid health risk.
引用
收藏
页数:11
相关论文
共 66 条
  • [1] LIKELIHOOD OF A MODEL AND INFORMATION CRITERIA
    AKAIKE, H
    [J]. JOURNAL OF ECONOMETRICS, 1981, 16 (01) : 3 - 14
  • [2] Albers S., 2020, rsoi: Import various northern and southern hemisphere climate indices. R package version 0.5.4
  • [3] Evaluating statistical model performance in water quality prediction
    Avila, Rodelyn
    Horn, Beverley
    Moriarty, Elaine
    Hodson, Roger
    Moltchanova, Elena
    [J]. JOURNAL OF ENVIRONMENTAL MANAGEMENT, 2018, 206 : 910 - 919
  • [4] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
    Blewitt, Marnie E.
    Gendrel, Anne-Valerie
    Pang, Zhenyi
    Sparrow, Duncan B.
    Whitelaw, Nadia
    Craig, Jeffrey M.
    Apedaile, Anwyn
    Hilton, Douglas J.
    Dunwoodie, Sally L.
    Brockdorff, Neil
    Kay, Graham F.
    Whitelaw, Emma
    [J]. NATURE GENETICS, 2008, 40 (05) : 663 - 669
  • [5] Multiclass classification methods in ecology
    Bourel, M.
    Segura, A. M.
    [J]. ECOLOGICAL INDICATORS, 2018, 85 : 1012 - 1021
  • [6] Consensus methods based on machine learning techniques for marine phytoplankton presence-absence prediction
    Bourel, M.
    Crisci, C.
    Martinez, A.
    [J]. ECOLOGICAL INFORMATICS, 2017, 42 : 46 - 54
  • [7] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [8] Predicting recreational water quality advisories: A comparison of statistical methods
    Brooks, Wesley
    Corsi, Steven
    Fienen, Michael
    Carvin, Rebecca
    [J]. ENVIRONMENTAL MODELLING & SOFTWARE, 2016, 76 : 81 - 94
  • [9] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [10] Chen C., 2004, Using Random Forest to Learn Imbalanced Data