Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation

被引:314
作者
Meyer, Hanna [1 ]
Reudenbach, Christoph [1 ]
Hengl, Tomislav [2 ]
Katurji, Marwan [3 ]
Nauss, Thomas [1 ]
机构
[1] Philipps Univ Marburg, Fac Geog, Deutschhausstr 10, D-35037 Marburg, Germany
[2] ISRIC World Soil Informat, POB 363, NL-6700 AJ Wageningen, Netherlands
[3] Univ Canterbury, Ctr Atmospher Res, Private Bag 4800, Christchurch 8020, New Zealand
关键词
Cross-validation; Feature selection; Over-fitting; Random forest; Spatio-temporal; Target-oriented validation; AIR-TEMPERATURE; CLASSIFICATION; INTERPOLATION; PRECIPITATION; ALGORITHMS; RETRIEVAL; PLATEAU; COVER;
D O I
10.1016/j.envsoft.2017.12.001
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Importance of target-oriented validation strategies for spatio-temporal prediction models is illustrated using two case studies: (1) modelling of air temperature (T-air) in Antarctica, and (2) modelling of volumetric water content (VW) for the R.J. Cook Agronomy Farm, USA. Performance of a random k-fold cross-validation (CV) was compared to three target-oriented strategies: Leave-Location-Out (LLO), Leave-Time-Out (LTO), and Leave-Location-and-Time-Out (LLTO) CV. Results indicate that considerable differences between random k-fold (R-2 = 0.9 for T-air and 0.92 for VW) and target-oriented CV (LLO R-2 = 0.24 for T-air and 0.49 for VW) exist, highlighting the need for target-oriented validation to avoid an overoptimistic view on models. Differences between random k-fold and target-oriented CV indicate spatial over-fitting caused by misleading variables. To decrease over-fitting, a forward feature selection in conjunction with target-oriented CV is proposed. It decreased over-fitting and simultaneously improved target-oriented performances (LLO CV R-2 = 0.47 for T-air and 0.55 for VW). (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1 / 9
页数:9
相关论文
共 40 条
[1]   Evaluating machine learning approaches for the interpolation of monthly air temperature at Mt. Kilimanjaro, Tanzania [J].
Appelhans, Tim ;
Mwangomo, Ephraim ;
Hardy, Douglas R. ;
Hemp, Andreas ;
Nauss, Thomas .
SPATIAL STATISTICS, 2015, 14 :91-113
[2]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[3]   Detecting rock glacier flow structures using Gabor filters and IKONOS imagery [J].
Brenning, Alexander ;
Long, Shilei ;
Fieguth, Paul .
REMOTE SENSING OF ENVIRONMENT, 2012, 125 :227-237
[4]   Machine learning for predicting soil classes in three semi-arid landscapes [J].
Brungard, Colby W. ;
Boettinger, Janis L. ;
Duniway, Michael C. ;
Wills, Skye A. ;
Edwards, Thomas C., Jr. .
GEODERMA, 2015, 239 :68-83
[5]   Model-based geostatistics [J].
Diggle, PJ ;
Tawn, JA ;
Moyeed, RA .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 1998, 47 :299-326
[6]   Spatio-temporal interpolation of soil water, temperature, and electrical conductivity in 3D+T: The Cook Agronomy Farm data set [J].
Gasch, Caley K. ;
Hengl, Tomislav ;
Graeler, Benedikt ;
Meyer, Hanna ;
Magney, Troy S. ;
Brown, David J. .
SPATIAL STATISTICS, 2015, 14 :70-90
[7]   A comparison of selected classification algorithms for mapping bamboo patches in lower Gangetic plains using very high resolution WorldView 2 imagery [J].
Ghosh, Aniruddha ;
Joshi, P. K. .
INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2014, 26 :298-311
[8]   Random Forests for land cover classification [J].
Gislason, PO ;
Benediktsson, JA ;
Sveinsson, JR .
PATTERN RECOGNITION LETTERS, 2006, 27 (04) :294-300
[9]   A Machine Learning Based Spatio-Temporal Data Mining Approach for Detection of Harmful Algal Blooms in the Gulf of Mexico [J].
Gokaraju, Balakrishna ;
Durbha, Surya S. ;
King, Roger L. ;
Younan, Nicolas H. .
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2011, 4 (03) :710-720
[10]   Towards observation-based gridded runoff estimates for Europe [J].
Gudmundsson, L. ;
Seneviratne, S. I. .
HYDROLOGY AND EARTH SYSTEM SCIENCES, 2015, 19 (06) :2859-2879