The value of human data annotation for machine learning based anomaly detection in environmental systems

被引:26
作者
Russo, Stefania [1 ,2 ]
Besmer, Michael D. [3 ]
Blumensaat, Frank [1 ,5 ]
Bouffard, Damien [1 ]
Disch, Andy [1 ]
Hammes, Frederik [1 ]
Hess, Angelika [1 ,5 ]
Lurig, Moritz [1 ,8 ,9 ]
Matthews, Blake [1 ,8 ]
Minaudo, Camille [6 ]
Morgenroth, Eberhard [1 ,5 ]
Tran-Khac, Viet [7 ]
Villez, Kris [1 ,4 ]
机构
[1] Eawag, Swiss Fed Inst Aqut Sci & Technol, CH-8600 Dubendorf, Switzerland
[2] Swiss Fed Inst Technol, Ecovis Lab Photogrammetry & Remote Sensing, Zurich, Switzerland
[3] onCyt Micro Biol AG, Zurich, Switzerland
[4] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
[5] Swiss Fed Inst Technol, Inst Environm Engn, CH-8093 Zurich, Switzerland
[6] Ecole Polytech Fed Lausanne, Phys Aqut Syst Lab, Margaretha Kamprad Chair, Lausanne, Switzerland
[7] Univ Savoie Mont Blanc, INRAE, CARRTEL, F-74200 Thonon Les Bains, France
[8] Eawag, Dept Fish Ecol & Evolut, Ctr Ecol Evolut & Biogeochem, 79 Seestr, CH-6047 Luzern, Switzerland
[9] Lund Univ, Dept Biol, S-22362 Lund, Sweden
关键词
Machine learning; Anomaly detection; Environmental systems; Labels; PRINCIPAL COMPONENT ANALYSIS; SEQUENCING BATCH REACTOR; FAULT-DETECTION; WATER-QUALITY; MULTIVARIATE; REGRESSION; NETWORK;
D O I
10.1016/j.watres.2021.117695
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Anomaly detection is the process of identifying unexpected data samples in datasets. Automated anomaly detection is either performed using supervised machine learning models, which require a labelled dataset for their calibration, or unsupervised models, which do not require labels. While academic research has produced a vast array of tools and machine learning models for automated anomaly detection, the research community focused on environmental systems still lacks a comparative analysis that is simultaneously comprehensive, objective, and systematic. This knowledge gap is addressed for the first time in this study, where 15 different supervised and unsupervised anomaly detection models are evaluated on 5 different environmental datasets from engineered and natural aquatic systems. To this end, anomaly detection performance, labelling efforts, as well as the impact of model and algorithm tuning are taken into account. As a result, our analysis reveals the relative strengths and weaknesses of the different approaches in an objective manner without bias for any particular paradigm in machine learning. Most importantly, our results show that expert-based data annotation is extremely valuable for anomaly detection based on machine learning.
引用
收藏
页数:10
相关论文
共 63 条
[1]   Multivariate SPC of a sequencing batch reactor for wastewater treatment [J].
Aguado, D. ;
Ferrer, A. ;
Ferrer, J. ;
Seco, A. .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2007, 85 (01) :82-93
[2]  
Aguado D., 2005, IWA C NUTR REM WAST, P755
[3]   Multivariate statistical monitoring of continuous wastewater treatment plants [J].
Aguado, Daniel ;
Rosen, Christian .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2008, 21 (07) :1080-1091
[4]   Advanced monitoring of water systems using in situ measurement stations: data validation and fault detection [J].
Alferes, Janelcy ;
Tik, Sovanna ;
Copp, John ;
Vanrolleghem, Peter A. .
WATER SCIENCE AND TECHNOLOGY, 2013, 68 (05) :1022-1030
[5]  
Amer M., 2013, P ACM SIGKDD WORKSH, P8, DOI DOI 10.1145/2500853.2500857
[6]  
An J., 2015, SPECIAL LECT IE, V2, P1, DOI DOI 10.1007/BF00758335
[7]  
[Anonymous], 2008, BMVC
[8]  
[Anonymous], Anomaly detection: A survey, DOI [DOI 10.1145/1541880.1541882, 10.1145/1541880.1541882.]
[9]  
[Anonymous], 2009, Synthesis lectures on artificial intelligence and machine learning
[10]   Functional unfold principal component analysis for automatic plant-based stress detection in grapevine [J].
Baert, Annelies ;
Villez, Kris ;
Steppe, Kathy .
FUNCTIONAL PLANT BIOLOGY, 2012, 39 (06) :519-530