Interpretation of ensemble learning to predict water quality using explainable artificial intelligence

被引:104
作者
Park, Jungsu [1 ]
Lee, Woo Hyoung [2 ]
Kim, Keug Tae [3 ]
Park, Cheol Young [4 ]
Lee, Sanghun [5 ]
Heo, Tae-Young [5 ]
机构
[1] Hanbat Natl Univ, Dept Civil & Environm Engn, 125 Dongseo Daero, Daejeon 34158, South Korea
[2] Univ Cent Florida, Dept Civil Environm & Construct Engn, 12800 Pegasus Dr, Orlando, FL 32816 USA
[3] Univ Suwon, Dept Environm & Energy Engn, 17 Wauan Gil, Hwaseong Si 18323, Gyeonggi Do, South Korea
[4] BAIES, Bayesian AI Lab, Fairfax, VA 22030 USA
[5] Chungbuk Natl Univ, Dept Informat & Stat, Chungdae Ro 1, Cheongju 28644, Chungbuk, South Korea
基金
新加坡国家研究基金会;
关键词
Algal management; Ensemble model; Machine learning; Water quality; XGBoost; NEURAL-NETWORKS; BLACK-BOX; BIOMASS; MODEL; MICROCYSTIS; INFORMATION; LAKE;
D O I
10.1016/j.scitotenv.2022.155070
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Algal bloom is a significant issue when managing water quality in freshwater; specifically, predicting the concentration of algae is essential to maintaining the safety of the drinking water supply system. The chlorophyll-a (Chl-a) concentration is a commonly used indicator to obtain an estimation of algal concentration. In this study, an XGBoost ensemble machine learning (ML) model was developed from eighteen input variables to predict Chl-a concentration. The composition and pretreatment of input variables to the model are important factors for improving model performance. Explainable artificial intelligence (XAI) is an emerging area of ML modeling that provides a reasonable interpretation of model performance. The effect of input variable selection on model performance was estimated, where the priority of input variable selection was determined using three indices: Shapley value (SHAP), feature importance (FI), and variance inflation factor (VIF). SHAP analysis is an XAI algorithm designed to compute the relative importance of input variables with consistency, providing an interpretable analysis for model prediction. The XGB models simulated with independent variables selected using three indices were evaluated with root mean square error (RMSE), RMSEobservation standard deviation ratio, and Nash-Sutcliffe efficiency. This study shows that the model exhibited the most stable performance when the priority of input variables was determined by SHAP. This implies that on-site monitoring can be designed to collect the selected input variables from the SHAP analysis to reduce the cost of overall water quality analysis. The independent variables were further analyzed using SHAP summary plot, force plot, target plot, and partial dependency plot to provide understandable interpretation on the performance of the XGB model. While XAI is still in the early stages of development, this study successfully demonstrated a good example of XAI application to improve the interpretation of machine learning model performance in predicting water quality.
引用
收藏
页数:12
相关论文
共 49 条
  • [11] Stochastic gradient boosting
    Friedman, JH
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2002, 38 (04) : 367 - 378
  • [12] Garson G.D., 1991, AI expert, V6, P46
  • [13] LSTM: A Search Space Odyssey
    Greff, Klaus
    Srivastava, Rupesh K.
    Koutnik, Jan
    Steunebrink, Bas R.
    Schmidhuber, Juergen
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2017, 28 (10) : 2222 - 2232
  • [14] XAI-Explainable artificial intelligence
    Gunning, David
    Stefik, Mark
    Choi, Jaesik
    Miller, Timothy
    Stumpf, Simone
    Yang, Guang-Zhong
    [J]. SCIENCE ROBOTICS, 2019, 4 (37)
  • [15] A fast learning algorithm for deep belief nets
    Hinton, Geoffrey E.
    Osindero, Simon
    Teh, Yee-Whye
    [J]. NEURAL COMPUTATION, 2006, 18 (07) : 1527 - 1554
  • [16] Modeling lake trophic state: a random forest approach
    Hollister, Jeffrey W.
    Milstead, W. Bryan
    Kreakie, Betty J.
    [J]. ECOSPHERE, 2016, 7 (03):
  • [17] Combination of artificial neural network and clustering techniques for predicting phytoplankton biomass of Lake Poyang, China
    Huang, Jiacong
    Gao, Junfeng
    Zhang, Yinjun
    [J]. LIMNOLOGY, 2015, 16 (03) : 179 - 191
  • [18] Thermal effects on the growth and fatty acid composition of four harmful algal bloom species: Possible implications for ichthyotoxicity
    Hyun, Bonggil
    Ju, Se-Jong
    Ko, Ah-Ra
    Choi, Keun-Hyung
    Jung, Seung Won
    Jang, Pung-Guk
    Jang, Min-Chul
    Moon, Chang Ho
    Shin, Kyoungsoon
    [J]. OCEAN SCIENCE JOURNAL, 2016, 51 (03) : 333 - 342
  • [19] Ke GL, 2017, ADV NEUR IN, V30
  • [20] Application of Artificial Neural Networks to Rainfall Forecasting in the Geum River Basin, Korea
    Lee, Jeongwoo
    Kim, Chul-Gyum
    Lee, Jeong Eun
    Kim, Nam Won
    Kim, Hyeonjun
    [J]. WATER, 2018, 10 (10)