Feature extraction and prediction of fine particulate matter (PM2.5) chemical constituents using four machine learning models

被引:19
作者
Lee, Young Su [1 ]
Choi, Eunhwa [2 ]
Park, Minjae [1 ]
Jo, Hyeri [1 ]
Park, Manho [3 ]
Nam, Eunjung [4 ]
Kim, Dai Gon [4 ]
Yi, Seung-Muk [5 ]
Kim, Jae Young [1 ]
机构
[1] Seoul Natl Univ, Dept Civil & Environm Engn, 1 Gwanak Ro, Seoul, South Korea
[2] Res Inst Ind Sci & Technol, Cheongam Ro, Pohang Si, Gyeongsangbuk D, South Korea
[3] Univ Illinois, Dept Civil & Environm Engn, 205 N Mathews Ave, Urbana, IL USA
[4] Natl Inst Environm Res Korea, Dept Air Qual Res, Incheon 22689, South Korea
[5] Seoul Natl Univ, Grad Sch Publ Hlth, Dept Environm Hlth Sci, 1 Gwanak Ro, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
PM2; 5 chemical constituents; Machine learning; Generative adversarial imputation networks; Fully connected deep neural network; Random forest; K-nearest neighbor;
D O I
10.1016/j.eswa.2023.119696
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The concentrations of fine particulate matter (PM2.5) constituents, which are very important and essential in-formation for the identification of air pollution sources, were predicted at three sites (Seoul, Ulsan, Baeng-nyeong) in South Korea between 2016 and 2018 using four machine learning (ML) models: generative adversarial imputation network (GAIN), fully connected deep neural network (FCDNN), random forest (RF), and k-nearest neighbor (kNN). 3 PM2.5 constituent groups, namely 8 ions, 2 carbons, and 15 trace elements, were targeted for prediction. The latest hyperparameter optimization techniques were used to learn air pollution characteristics from ambient PM2.5-related information, such as time, meteorology, and air pollutant concen-trations. We compared the feature extraction abilities of the four models. The prediction accuracy identified by the coefficient of determination (R2) between prediction and observation was highest in GAIN, followed by FCDNN and RF or kNN. On availability of data on the time, air pollutant concentrations, and/or meteorology, simultaneously missed 20 % data of all PM2.5 constituent groups were predicted, with R2 = 0.897, 0.861, 0.785, and 0.801 by the GAIN, FCDNN, RF, and kNN, respectively. As missing ratios (20 %, 40 %, 60 %, 80 %) of input data increased, prediction accuracy decreased in the four models and was predominantly more noticeable in GAIN and kNN. As the available period of data increased, prediction accuracy increased in GAIN and FCDNN. Trace elements were predicted with the lowest R2 in all models among the target constituent groups. Study sites with more emission sources showed lower prediction accuracy, resulting in the highest R2 in Baengnyeong island and the lowest in Ulsan. According to the current findings, ML models can be used to evaluate various air pollution issues for which data is missing.
引用
收藏
页数:10
相关论文
共 54 条
[1]   State-of-the-art in artificial neural network applications: A survey [J].
Abiodun, Oludare Isaac ;
Jantan, Aman ;
Omolara, Abiodun Esther ;
Dada, Kemi Victoria ;
Mohamed, Nachaat AbdElatif ;
Arshad, Humaira .
HELIYON, 2018, 4 (11)
[2]  
Alpaydin E., 2020, INTRO MACHINE LEARNI
[3]  
Andrews J., 2020, GENERATING MISSING U, DOI [10.15530/urtec-2020-3014, DOI 10.15530/URTEC-2020-3014]
[4]   Scour modeling using deep neural networks based on hyperparameter optimization [J].
Asim, Mohammed ;
Rashid, Adnan ;
Ahmad, Tanvir .
ICT EXPRESS, 2022, 8 (03) :357-362
[5]   Hyperopt: A Python library for model selection and hyperparameter optimization [J].
Bergstra, James ;
Komer, Brent ;
Eliasmith, Chris ;
Yamins, Dan ;
Cox, David D .
Computational Science and Discovery, 2015, 8 (01)
[6]   A random forest guided tour [J].
Biau, Gerard ;
Scornet, Erwan .
TEST, 2016, 25 (02) :197-227
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   A Machine Learning Approach to Predict Air Quality in California [J].
Castelli, Mauro ;
Clemente, Fabiana Martins ;
Popovic, Ales ;
Silva, Sara ;
Vanneschi, Leonardo .
COMPLEXITY, 2020, 2020
[9]   An LSTM-based aggregated model for air pollution forecasting [J].
Chang, Yue-Shan ;
Chiao, Hsin-Ta ;
Abimannan, Satheesh ;
Huang, Yo-Ping ;
Tsai, Yi-Ting ;
Lin, Kuan-Ming .
ATMOSPHERIC POLLUTION RESEARCH, 2020, 11 (08) :1451-1463
[10]   The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation [J].
Chicco, Davide ;
Warrens, Matthijs J. ;
Jurman, Giuseppe .
PEERJ COMPUTER SCIENCE, 2021,