Neglecting spatial autocorrelation causes underestimation of the error of sugarcane yield models

被引:12
作者
Ferraciolli, Matheus A. [1 ]
Bocca, Felipe F. [1 ]
Rodrigues, Luiz Henrique A. [1 ]
机构
[1] Univ Estadual Campinas, Sch Agr Engn, Campinas, SP, Brazil
关键词
Boosted regression trees; Random forest; Support vector regression; RReliefF; Spatial clustering; TEMPERATURE; PREDICTION; TERRA;
D O I
10.1016/j.compag.2018.09.003
中图分类号
S [农业科学];
学科分类号
09 ;
摘要
With the increased application of information technology in agriculture, data is being produced and used in an unprecedented scale. While these advances, combined with machine learning techniques, benefited yield modeling, most of the current literature about data-driven yield modeling has not yet accounted for potential sources of correlation in data, assuming independence between samples. In this scenario, random sampling can lead to correlated samples across sets being used for model evaluation. We implemented a spatially-aware protocol and compared it with the naive approach of assuming independence between samples. The protocols were applied through all the model development pipeline: data splitting for hold-out sets, feature selection, cross-validation for model adjustment and model evaluation. Three different machine learning techniques were used to create models in each protocol. The resulting models were evaluated both in the validation set created by each protocol and in a manually created independent set. This independent set ensured there was no auto-correlation between the samples used for modeling. We showed that assuming independence when modeling yield leads to underestimating model errors and overfit during model adjustment. Despite better error tracking, the model with the smallest error in the test set was not the model with the smallest validation error, suggesting overfit for the model selection. While this effect was small for the spatially-aware protocol, the effect was a lot stronger in the naive protocol. Future efforts in yield modeling should address the effect of spatial auto-correlation and other potential sources of correlation to improve correctness and robustness of the results.
引用
收藏
页码:233 / 240
页数:8
相关论文
共 33 条
[1]   A YIELD PREDICTION MODEL FOR FLORIDA SUGARCANE [J].
ALVAREZ, J ;
CRANE, DR ;
SPREEN, TH ;
KIDDER, G .
AGRICULTURAL SYSTEMS, 1982, 9 (03) :161-179
[2]  
[Anonymous], 2001, R News
[3]  
[Anonymous], MODIS AQUA LAND SURF
[4]  
[Anonymous], 2014, J AGR SCI, DOI DOI 10.1017/S0021859614000392
[5]  
[Anonymous], 2013, Spatial autocorrelation and spatial filtering:Gaining understanding through theory and scientific visualization
[6]  
[Anonymous], MODIS TERRA LAND SUR
[7]  
[Anonymous], MACH LEARN MACH LEARN
[8]  
[Anonymous], EMBRAPA SOLOS LIVROS
[9]  
Bergstra J, 2012, J MACH LEARN RES, V13, P281
[10]   The effect of tuning, feature engineering, and feature selection in data mining applied to rainfed sugarcane yield modelling [J].
Bocca, Felipe F. ;
Antunes Rodrigues, Luiz Henrique .
COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2016, 128 :67-76