Imputing environmental impact missing data of the industrial sector for Chinese cities: A machine learning approach

被引:20
作者
Chen, Xi [1 ]
Shuai, Chenyang [2 ,3 ,4 ]
Zhao, Bu [3 ,4 ]
Zhang, Yu [5 ]
Li, Kaijian [2 ]
机构
[1] Southwest Univ, Coll Econ & Management, Chongqing, Peoples R China
[2] Chongqing Univ, Sch Management Sci & Real Estate, Chongqing, Peoples R China
[3] Univ Michigan, Sch Environm & Sustainabil, Ann Arbor, MI USA
[4] Univ Michigan, Michigan Inst Computat Discovery & Engn, Ann Arbor, MI USA
[5] Chongqing Jiaotong Univ, Sch Econ & Management, Chongqing, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Industrial consumption and pollution; Environmental management; Data-driven approach; Missing data; Machine learning; IMPUTATION;
D O I
10.1016/j.eiar.2023.107050
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Data are the lifeblood of evidence-based decision-making and the raw material for accountability. Collecting data to regularly evaluate industrial consumption and pollution at the city level is not an easy task, which needs a significant investment of institutional and financial resources and engagement with a vast number of local governments. Despite the Chinese government putting extensive human and financial resources into data collection, there are still substantial data gaps. This study compared two traditional linear models and four machine learning models to computationally estimate missing data of six industrial consumption and pollution indicators (responses) of 701 cities from 2006 to 2018 with ten predictors. Results showed that a decision-tree based extreme gradient boosting model developed performed best among the six models. The median values of coefficient of determination (R2) and root mean squared error of six responses ranged between 0.85 and 0.94 and 8.5 to 17,776, respectively. This study provided high-quality and detailed data for industrial environmental analysis of Chinese cities. In addition, the extreme gradient boosting model could be adapted to impute the missing data for other environmental variables of other sectors and at an even smaller scale given its good generalization ability.
引用
收藏
页数:9
相关论文
共 62 条
  • [1] GaS_GeoT: A computer program for an effective use of newly improved gas geothermometers in predicting reliable geothermal reservoir temperatures
    Acevedo-Anicasio, A.
    Santoyo, E.
    Perez-Zarate, D.
    Pandarinath, Kailasa
    Guevara, M.
    Diaz-Gonzalez, L.
    [J]. GEOTHERMAL ENERGY, 2021, 9 (01)
  • [2] Aiken L.S., 2012, Handbook of Psychology, VSecond
  • [3] Ajiboye A., 2015, EVALUATING EFFECT DA
  • [4] Multiple Imputation for Incomplete Data in Environmental Epidemiology Research
    Allotey, Prince Addo
    Harel, Ofer
    [J]. CURRENT ENVIRONMENTAL HEALTH REPORTS, 2019, 6 (02) : 62 - 71
  • [5] Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis
    Alwosheel, Ahmad
    van Cranenburgh, Sander
    Chorus, Caspar G.
    [J]. JOURNAL OF CHOICE MODELLING, 2018, 28 : 167 - 182
  • [6] [Anonymous], DATA
  • [7] Spatiotemporal land use random forest model for estimating metropolitan NO2 exposure in Japan
    Araki, Shin
    Shima, Masayuki
    Yamamoto, Kouhei
    [J]. SCIENCE OF THE TOTAL ENVIRONMENT, 2018, 634 : 1269 - 1277
  • [8] Awad M., 2015, EFFICIENT LEARNING M, P67, DOI [10.1007/978-1-4302-5990-9_4, DOI 10.1007/978-1-4302-5990-9_4]
  • [9] Anomaly monitoring improves remaining useful life estimation of industrial machinery
    Aydemir, Gurkan
    Acar, Burak
    [J]. JOURNAL OF MANUFACTURING SYSTEMS, 2020, 56 : 463 - 469
  • [10] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32