Methods for imputation of missing values in air quality data sets

被引:415
作者
Junninen, H
Niska, H
Tuppurainen, K
Ruuskanen, J
Kolehmainen, M
机构
[1] Univ Kuopio, Dept Environm Sci, FIN-70211 Kuopio, Finland
[2] Commiss European Communities, Inst Environm & Sustainabil, I-21020 Ispra, Italy
[3] Univ Kuopio, Dept Chem, FIN-70211 Kuopio, Finland
关键词
missing data; air quality; multivariate; imputing; neural networks;
D O I
10.1016/j.atmosenv.2004.02.026
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Methods for data imputation applicable to air quality data sets were evaluated in the context of univariate (linear, spline and nearest neighbour interpolation), multivariate (regression-based imputation (REGEM), nearest neighbour (NN), self-organizing map (SOM), multi-layer perceptron (MLP)), and hybrid methods of the previous by using simulated missing data patterns. Additionally, a multiple imputation procedure was considered in order to make comparison between single and multiple imputations schemes. Four statistical criteria were adopted: the index of agreement, the squared correlation coefficient (R 2), the root mean square error and the mean absolute error with bootstrapped standard errors. The results showed that the performance of interpolation in respect to the length of gaps could be estimated separately for each variable of air quality by calculating a gradient and an exponent alpha (Hurst exponent). This can be further utilised in hybrid approach in which the imputation has been performed either by interpolation or multivariate method depending on the length of gaps and variable under study. Among the multivariate methods, SOM and MLP performed slightly better than REGEM and NN methods. The advantage of SOM over the others was that it was less dependent on the actual location of the missing values. If priority is given to computational speed, however, NN can be recommended. The results in general showed that the slight improvement in the performances of multivariate methods can be achieved by using the hybridisation and more substantial one by using the multiple imputations where a final estimate is composed of the outputs of several multivariate fill-in methods. (C) 2004 Elsevier Ltd. All rights reserved.
引用
收藏
页码:2895 / 2907
页数:13
相关论文
共 21 条
  • [1] [Anonymous], 1997, ANAL INCOMPLETE MULT, DOI DOI 10.1201/9781439821862
  • [2] FI-GEM networks for incomplete time-series prediction
    Chiewchanwattana, S
    Lursinsap, C
    [J]. PROCEEDING OF THE 2002 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-3, 2002, : 1757 - 1762
  • [3] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [4] PATTERN-RECOGNITION WITH PARTLY MISSING DATA
    DIXON, JK
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1979, 9 (10): : 617 - 621
  • [5] Efron B., 1994, INTRO BOOTSTRAP, DOI DOI 10.1201/9780429246593
  • [6] Feder J., 1988, Fractals. Physics of Solids and Liquids
  • [7] Artificial neural networks (the multilayer perceptron) - A review of applications in the atmospheric sciences
    Gardner, MW
    Dorling, SR
    [J]. ATMOSPHERIC ENVIRONMENT, 1998, 32 (14-15) : 2627 - 2636
  • [8] Neural network modelling and prediction of hourly NOx and NO2 concentrations in urban air in London
    Gardner, MW
    Dorling, SR
    [J]. ATMOSPHERIC ENVIRONMENT, 1999, 33 (05) : 709 - 719
  • [9] HAKKINEN E, 2001, THESIS U JYVASKYLA F
  • [10] KOHONEN T, 1997, SELFORGANIZING MAPS