Enhancing outlier detection in air quality index data using a stacked machine learning model

被引:0
作者
Diallo, Abdoul Aziz [1 ]
Nderu, Lawrence [2 ]
Malenje, Bonface Miya [3 ]
Kikuvi, Gideon Mutie [4 ]
机构
[1] Pan African Univ, Inst Basic Sci Technol & Innovat PAUSTI, Dept Math Data Sci, Nairobi 6200000200, Kenya
[2] Jomo Kenyatta Univ Agr & Technol JKUAT, Sch Comp & InformationTechnol, Dept Comp, Nairobi, Kenya
[3] Jomo Kenyatta Univ Agr & Technol JKUAT, Dept Stat & Actuarial Sci, Nairobi, Kenya
[4] Jomo Kenyatta Univ Agr & Technol JKUAT, Dept Environm Hlth & Dis Control, Nairobi, Kenya
关键词
air pollution; air quality index; data mining; environmental analysis; gradient boosting classifier; K-means; outlier detection; random forest;
D O I
10.1002/eng2.12936
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The air quality index (AQI) is a commonly employed metric for evaluating air quality across diverse locations and temporal spans. Similar to other environmental datasets, AQI data can exhibit outliers data points markedly divergent from the norm, signifying instances of exceptionally favorable or adverse air quality. This becomes crucial in identifying and comprehending severe pollution episodes with far-reaching environmental and public health implications. This study utilizes air quality data from January 1, 2014, to January 31, 2021, collected at daily intervals in Shanghai City, China, as the experimental dataset. The dataset includes daily AQI measurements, along with six pollutant concentrations: particulate matter (PM2.5 and PM10), sulfur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3), and carbon monoxide (CO). Each pollutant's concentration is measured in micrograms per cubic meter (mu$$ \upmu $$g/m 3$$ {}<^>3 $$). The dataset is then preprocessed by cleaning and normalizing it before using K-means clustering to discover different patterns. A stacked ensemble machine learning model that incorporates K-means clustering, random forest (RF) and gradient boosting classifier (GBC) is developed and compared to decision tree, support vector machine, K-nearest neighbor and Naive Bayes algorithms to evaluate its performance in identifying outliers using accuracy, precision, recall, and F1-score. The stacked model outperformed all other established models based on the accuracy, precision, recall, and F1-score of 0.99, 0.99, 0.97, and 0.99, respectively. This study explores outlier detection in Shanghai's air quality index (AQI) data from January 2014 to January 2021 using a stacked ensemble machine learning model combining K-means clustering, random forest, and gradient boosting classifier. The model's performance, surpassing traditional methods like decision trees and SVMs, is evaluated through metrics like accuracy and F1-score, demonstrating its effectiveness in identifying significant pollution episodes with implications for environmental and public health. image
引用
收藏
页数:18
相关论文
共 30 条
  • [1] Anandharajan TRV., 2016, IDENTIFICATION OUTLI
  • [2] [Anonymous], Ensemble learning methods: bagging, boosting and stacking
  • [3] INCREMENTAL PRINCIPAL COMPONENT ANALYSIS BASED OUTLIER DETECTION METHODS FOR SPATIOTEMPORAL DATA STREAMS
    Bhushan, Alka
    Sharker, Monir H.
    Karimi, Hassan A.
    [J]. ISPRS INTERNATIONAL WORKSHOP ON SPATIOTEMPORAL COMPUTING, 2015, : 67 - 71
  • [4] Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction
    Chandra, Winoto
    Suprihatin, Bambang
    Resti, Yulia
    [J]. SYMMETRY-BASEL, 2023, 15 (04):
  • [5] Chapra SC, 1988, NUMERICAL METHODS EN
  • [6] Chuanqi X., 2022, CHEMOSPHERE, V294
  • [7] A concept of the air quality monitoring system in the city of Lublin with machine learning methods to detect data outliers<bold> </bold>
    Cieplak, Tomasz
    Rymarczyk, Tomasz
    Tomaszewski, Robert
    [J]. III INTERNATIONAL CONFERENCE OF COMPUTATIONAL METHODS IN ENGINEERING SCIENCE (CMES 18), 2019, 252
  • [8] The influence of meteorological factors and terrain on air pollution concentration and migration: a geostatistical case study from Krakow, Poland
    Danek T.
    Weglinska E.
    Zareba M.
    [J]. Scientific Reports, 12 (1)
  • [9] Davda K., 2019, AIR QUALITY INDEX IM
  • [10] Halsana S., 2020, Int J Sci Res Comput Sci Eng Inf Technol, V8, P190, DOI [10.32628/CSEIT206435, DOI 10.32628/CSEIT206435]