Retrieving ground-level PM2.5 concentrations in China (2013-2021) with a numerical-model-informed testbed to mitigate sample-imbalance-induced biases

被引：2

作者：

Li, Siwei ^{[1
,3
,4
]}

Ding, Yu ^{[1
]}

Xing, Jia ^{[2
]}

Fu, Joshua S. ^{[2
]}

机构：

[1] Wuhan Univ, Sch Remote Sensing & Informat Engn, Hubei Key Lab Quantitat Remote Sensing Land & Atmo, Wuhan 430000, Hubei, Peoples R China

[2] Univ Tennessee, Dept Civil & Environm Engn, Knoxville, TN 37996 USA

[3] Wuhan Univ, State Key Lab Informat Engn Surveying Mapping & Re, Wuhan 430079, Peoples R China

[4] Wuhan Univ, Hubei Luojia Lab, Wuhan 430079, Peoples R China

来源：

EARTH SYSTEM SCIENCE DATA | 2024年 / 16卷 / 08期

基金：

中国国家自然科学基金;

关键词：

RANDOM FOREST; MULTISCALE; POLLUTION; GASES;

D O I：

10.5194/essd-16-3781-2024

中图分类号：

P [天文学、地球科学];

学科分类号：

07 ;

摘要：

Ground-level PM2.5 data derived from satellites with machine learning are crucial for health and climate assessments. However, uncertainties persist due to the absence of spatially covered observations. To address this, we propose a novel testbed using nontraditional numerical simulations to evaluate PM2.5 estimation across the entire spatial domain. The testbed emulates the general machine-learning approach by training the model with grids corresponding to ground monitoring sites and subsequently testing its predictive accuracy for other locations. Our approach enables comprehensive evaluation of various machine-learning methods' performance in estimating PM2.5 across the spatial domain for the first time. Unexpected results are shown in the application in China, with larger absolute PM2.5 biases found in densely populated regions with abundant ground observations across all benchmark models due to the higher baseline concentration, though the relative error (approximately 20 %) is smaller compared to that in rural areas (over 50 %). The imbalance in training samples, mostly from urban areas with high emissions, is the main reason, leading to significant overestimation due to the lack of monitors in downwind areas where PM2.5 is transported from urban areas with varying vertical profiles. Our proposed testbed also provides an efficient strategy for optimizing model structure or training samples to enhance satellite-retrieval model performance. Integration of spatiotemporal features, especially with conventional neural network (CNN)-based deep-learning approaches like the residual neural network (ResNet) model, has successfully mitigated PM2.5 overestimation (by 5-30 mu g m(-3)) and the corresponding exposure (by 3 million people & sdot; mu g m(-3)) in the downwind area over 9 years (2013-2021) compared to the traditional approach. Furthermore, the incorporation of 600 strategically positioned ground monitoring sites identified through the testbed is essential for achieving a more balanced distribution of training samples, thereby ensuring precise PM2.5 estimation and facilitating the assessment of the associated impacts in China. In addition to presenting the retrieved surface PM2.5 concentrations in China from 2013 to 2021, this study provides a testbed dataset derived from physical modeling simulations which can serve to evaluate the performance of data-driven methodologies, such as machine learning, in estimating spatial PM2.5 concentrations for the community (Li et al., 2024a; https://doi.org/10.5281/zenodo.11122294).

引用

页码：3781 / 3793

页数：13