Predicting water quality variables using gradient boosting machine: global versus local explainability using SHapley Additive Explanations (SHAP)

被引：4

作者：

Merabet, Khaled ^{[1
]}

Di Nunno, Fabio ^{[2
]}

Granata, Francesco ^{[2
]}

Kim, Sungwon ^{[3
]}

Adnan, Rana Muhammad ^{[4
,7
]}

Heddam, Salim ^{[1
]}

Kisi, Ozgur ^{[5
,8
]}

Zounemat-Kermani, Mohammad ^{[6
]}

机构：

[1] Univ 20 Aout 1955, Fac Sci, Agron Dept, Hydraul Div, Route El Hadaik,BP 26, Skikda, Algeria

[2] Univ Cassino & Southern Lazio, Dept Civil & Mech Engn DICEM, Via Biasio, 43, I-03043 Cassino, Frosinone, Italy

[3] Dongyang Univ, Dept Railroad Construct & Safety Engn, Yeongju 36040, South Korea

[4] Guangzhou Univ, Coll Architecture & Urban Planning, Guangzhou 510006, Peoples R China

[5] IIia State Univ, Sch Technol, Dept Civil Engn, Tbilisi 0179, Georgia

[6] Shahid Bahonar Univ Kerman, Dept Civil Engn, Kerman, Iran

[7] Saveetha Inst Med & Tech Sci, Ctr global Hlth Res, Chennai 600001, India

[8] Korea Univ, Sch Civil Environm & Architectural Engn, Seoul 02841, South Korea

来源：

EARTH SCIENCE INFORMATICS | 2025年 / 18卷 / 03期

关键词：

Modelling; Water quality; Chl-a; DO; TU; AdaBoost; Boosting models; SHAP; SHORT-TERM-MEMORY; DISSOLVED-OXYGEN; LEARNING-MODEL; XGBOOST; RIVER; FRAMEWORK;

D O I：

10.1007/s12145-025-01796-y

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Water quality assessment is critical for ensuring the health of aquatic ecosystems and managing water resources effectively. However, accurately predicting key water quality variables remains challenging due to the complex interactions between environmental factors and anthropogenic influences. In the present investigation, a new modelling framework is proposed for better prediction of three water quality variables, namely: (i) dissolved oxygen concentration (DO), (ii) water turbidity (TU), and (iii) water Chlorophyll a (Chl-a). Six machine learning models, i.e., adaptive boosting (AdaBoost), categorical boosting (CatBoost), histogram gradient boosting (HistGBRT), light gradient boosting machine (LightGBM), natural gradient boosting (NGBoost), and extreme gradient boosting (XGBoost), both applied and compared based on the combination of a large number of water quality variables. All models were developed using data collected from three stations: (i) USGS 05543010 Illinois River at Seneca, Illinois County, (ii) USGS 05586300 Illinois River at Florence, Illinois County, and (iii) USGS 05553700 Illinois River at Starved Rock, Illinois County, USA. The SHapley additive explanations (SHAP) was adopted in the present study for model interpretability and feature ranking. Furthermore, all models were compared using various numerical indices and graphical representations. From the obtained results we can draw the following conclusion. DO concentration can be predicted very well with high numerical performances, and the CatBoost model was found to be the best one exhibiting excellent numerical index: RMSE (0.430), MAE (0.326), R (0.980) and NSE (0.961), respectively. For Chl-a, all models were found to be less accurate and the best performances were obtained using the LightGBM with RMSE (5.916), MAE (4.294), R (0.892) and NSE (0.795), respectively. Finally, for water TU, none of the models were found to be accurate and very poor performances were obtained. Finally, the use of the SHAP has significantly helped in better understanding the overall contribution of the various water variables in the finale prediction.

引用

页数：34

共 76 条

[1] Deep learning-based algorithms for long-term prediction of chlorophyll-a in catchment streams [J].

Abbas, Ather ;

Park, Minji ;

Baek, Sang-Soo ;

Cho, Kyung Hwa .

JOURNAL OF HYDROLOGY, 2023, 626

[2]

Abbas F, 2024, Uncertainty analysis of predictive models for water quality index: comparative analysis of XGBoost, random forest, SVM, KNN, gradient boosting, and decision tree algorithms, DOI 10.20944/preprints202402.0828.v1

[3] Utilizing grid search cross-validation with adaptive boosting for augmenting performance of machine learning models [J].

Adnan, Muhammad ;

Alarood, Alaa Abdul Salam ;

Uddin, M. Irfan ;

Rehman, Izaz Ur .

PEERJ COMPUTER SCIENCE, 2022, 8

[4] Proposition of New Ensemble Data-Intelligence Models for Surface Water Quality Prediction [J].

Al-Sulttani, Ali Omran ;

Al-Mukhtar, Mustafa ;

Roomi, Ali B. ;

Farooque, Aitazaz Ahsan ;

Khedher, Khaled Mohamed ;

Yaseen, Zaher Mundher .

IEEE ACCESS, 2021, 9 :108527-108541

[5] Prediction of water quality indexes with ensemble learners: Bagging and boosting [J].

Aldrees, Ali ;

Awan, Hamad Hassan ;

Javed, Muhammad Faisal ;

Mohamed, Abdeliazim Mustafa .

PROCESS SAFETY AND ENVIRONMENTAL PROTECTION, 2022, 168 :344-361

[6] Explainable AI for Retinoblastoma Diagnosis: Interpreting Deep Learning Models with LIME and SHAP [J].

Aldughayfiq, Bader ;

Ashfaq, Farzeen ;

Jhanjhi, N. Z. ;

Humayun, Mamoona .

DIAGNOSTICS, 2023, 13 (11)

[7] Stream water quality prediction using boosted regression tree and random forest models [J].

Alnahit, Ali O. ;

Mishra, Ashok K. ;

Khan, Abdul A. .

STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT, 2022, 36 (09) :2661-2680

[8] Addressing drivers and data gaps in Spain's non-compliance of drinking water quality standards [J].

Andries, Delia M. ;

Garrido, Alberto ;

De Stefano, Lucia .

Science of the Total Environment, 2025, 963

[9] River water quality index prediction and uncertainty analysis: A comparative study of machine learning models [J].

Asadollah, Seyed Babak Haji Seyed ;

Sharafati, Ahmad ;

Motta, Davide ;

Yaseen, Zaher Mundher .

JOURNAL OF ENVIRONMENTAL CHEMICAL ENGINEERING, 2021, 9 (01)

[10] Relation between prognostics predictor evaluation metrics andlocal interpretability SHAP values [J].

Baptista, Marcia L. ;

Goebel, Kai ;

Henriques, Elsa M. P. .

ARTIFICIAL INTELLIGENCE, 2022, 306

← 1 2 3 4 5 6 7 8 →