The spatial leave-pair-out cross-validation method for reliable AUC estimation of spatial classifiers

被引:14
作者
Airola, Antti [1 ]
Pohjankukka, Jonne [1 ]
Torppa, Johanna [2 ]
Middleton, Maarit [3 ]
Nykanen, Vesa [3 ]
Heikkonen, Jukka [1 ]
Pahikkala, Tapio [1 ]
机构
[1] Univ Turku, Dept Future Technol, Turku 20014, Finland
[2] Geol Survey Finland, Neulaniementie 5,POB 1237, FIN-70211 Kuopio, Finland
[3] Geol Survey Finland, Lahteentie 2,POB 77, FIN-96101 Rovaniemi, Finland
关键词
Area under ROC curve; Classifier evaluation; Cross-validation; Mineral prospectivity mapping; Spatial data mining; MINERAL PROSPECTIVITY; GREENSTONE-BELT; NEURAL-NETWORKS; OROGENIC GOLD; PREDICTION; MODELS; MACHINE; AREA; ROC; CLASSIFICATION;
D O I
10.1007/s10618-018-00607-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning based classification methods are widely used in geoscience applications, including mineral prospectivity mapping. Typical characteristics of the data, such as small number of positive instances, imbalanced class distributions and lack of verified negative instances make ROC analysis and cross-validation natural choices for classifier evaluation. However, recent literature has identified two sources of bias, that can affect reliability of area under ROC curve estimation via cross-validation on spatial data. The pooling procedure performed by methods such as leave-one-out can introduce a substantial negative bias to results. At the same time, spatial dependencies leading to spatial autocorrelation can result in overoptimistic results, if not corrected for. In this work, we introduce the spatial leave-pair-out cross-validation method, that corrects for both of these biases simultaneously. The methodology is used to benchmark a number of classification methods on mineral prospectivity mapping data from the Central Lapland greenstone belt. The evaluation highlights the dangers of obtaining misleading results on spatial data and demonstrates how these problems can be avoided. Further, the results show the advantages of simple linear models for this classification task.
引用
收藏
页码:730 / 747
页数:18
相关论文
共 50 条
[1]   Support vector machine for multi-classification of mineral prospectivity areas [J].
Abedi, Maysam ;
Norouzi, Gholam-Hossain ;
Bahroudi, Abbas .
COMPUTERS & GEOSCIENCES, 2012, 46 :272-283
[2]  
AIRO M.-L., 2005, GEOLOGICAL SURVEY FI
[3]  
Airola A., 2009, MACHINE LEARNING SYS, P3
[4]   An experimental comparison of cross-validation techniques for estimating the area under the ROC curve [J].
Airola, Antti ;
Pahikkala, Tapio ;
Waegeman, Willem ;
De Baets, Bernard ;
Salakoski, Tapio .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2011, 55 (04) :1828-1844
[5]  
[Anonymous], 2001, The elements of statistical learning: data mining, inference and prediction
[6]   AREA ABOVE ORDINAL DOMINANCE GRAPH AND AREA BELOW RECEIVER OPERATING CHARACTERISTIC GRAPH [J].
BAMBER, D .
JOURNAL OF MATHEMATICAL PSYCHOLOGY, 1975, 12 (04) :387-415
[7]  
Bonham-Garter G, 1994, COMPUTER METHODS GEO
[8]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]  
Brown WM, 2003, NAT RESOUR RES, V12, P141, DOI DOI 10.1023/A:1024218913435