Development of a data-driven ensemble regressor and its applicability for identifying contextual and collective outliers in groundwater level time-series data

被引:5
作者
Kim, Yuhan [1 ]
Jeong, Jiho [1 ]
Park, Heejeong [1 ]
Kwon, Mijin [2 ]
Cho, Chunhyung [2 ]
Jeong, Jina [1 ]
机构
[1] Kyungpook Natl Univ, Dept Geol, Daegu, South Korea
[2] Korea Radioact Waste Agcy KORAD, Daejeon, South Korea
关键词
Long short-term memory; Groundwater level fluctuation; Ensemble estimation; Normal data range; Contextual and collective outlier identification; PRECIPITATION; ISLAND;
D O I
10.1016/j.jhydrol.2022.128127
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
In this study, a method to estimate the normal range of groundwater level time-series data was developed to identify outliers in terms of the global, contextual, and collective sense. To evaluate the normal range of groundwater level time-series data, the statistical characteristics of the data and the patterns of the precipitation time-series data were incorporated into the LSTM (Long Short-Term Memory)-based ensemble regressor (i.e., the LER model). Based on the LER model, multiple possible trends of the groundwater level were generated, and the general rules of outlier identification methods (i.e., sigma and Tukey's fences (TF) rules) were applied to the LER ensemble estimation result to finally define the range of the normal data. For outlier identification performance validation, the actual groundwater level acquired from three groundwater monitoring stations in South Korea (i. e., the Pohang-Gibuk (PG), Namwon-Dotong (ND), and Jeju-Sangyae (JS) monitoring wells) and the corresponding precipitation data acquired from the nearest weather stations were applied to the study. As the reference method for comparative performance validation, simple applications of the sigma and TF rules were used. For the monitoring data, the developed LER-based outlier identification method evaluates the range of the data that might be explained by the modelled influences of the interest (i.e., normal data range). The developed method showed an outlier identification performance of >70% in general while the performance of the sigma and TF rules was mostly <50%. In particular, as the method effectively estimated the seasonal trend and the variability of the groundwater level with consideration of the precipitation patterns and statistics on the groundwater level variation, it is superior for identifying the contextual or collective outliers compared to the simple sigma and TF rules. Through in-depth analysis, it can be concluded that the developed LER-based outlier identification method is effective for discriminating the abnormal data by considering the intrinsic statistical characteristics of the original data trend and the exogenous factors. In the aspect of the practical applicability, as the result can be automatically acquired based on real-time monitoring data, the developed method is expected to apply for more efficient maintenance of the monitoring devices by embedding the model as the management software into the monitoring network system.
引用
收藏
页数:15
相关论文
共 41 条
[1]  
Aggarwal CharuC., 2017, Outlier analysis
[2]  
Audibert J., 2022, PATTERN RECOGN
[3]  
Bontempi G, 2013, LECT NOTES BUS INF P, V138, P62
[4]  
Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1007/BF00058655
[5]  
Brown G, 2005, J MACH LEARN RES, V6, P1621
[6]   ESTIMATION OF TIME-SERIES PARAMETERS IN THE PRESENCE OF OUTLIERS [J].
CHANG, I ;
TIAO, GC ;
CHEN, C .
TECHNOMETRICS, 1988, 30 (02) :193-204
[7]  
Chawla S., 2013, SIAM INT C DATA MINI, DOI [10.1137/1.9781611972832.21, DOI 10.1137/1.9781611972832.21]
[8]  
Daehyung Park, 2018, IEEE Robotics and Automation Letters, V3, P1544, DOI [10.1109/lra.2018.2801475, 10.1109/LRA.2018.2801475]
[9]  
Divya D, 2016, PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON DATA MINING AND ADVANCED COMPUTING (SAPIENCE), P23, DOI 10.1109/SAPIENCE.2016.7684114
[10]  
Gibbons R. D., 2009, Statistical methods for groundwater monitoring