On the relationship between training sample size and data dimensionality: Monte Carlo analysis of broadband multi-temporal classification

被引:150
作者
Van Niel, TG
McVicar, TR
Datt, B
机构
[1] CSIRO Land & Water, Wembley, WA 6913, Australia
[2] CSIRO Land & Water, Canberra, ACT 2601, Australia
[3] CSIRO, Earth Observat Ctr, Canberra, ACT 2601, Australia
[4] Cooperat Res Ctr Sustainable Rice Prod, Yanco, NSW 2703, Australia
关键词
crop classification; dimensionality; training sample; time-series; multi-temporal; maximum likelihood;
D O I
10.1016/j.rse.2005.08.011
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
The number of training samples per class (n) required for accurate Maximum Likelihood (ML) classification is known to be affected by the number of bands (p) in the input image. However, the general rule which defines that n should be 10p to 30p is often enforced universally in remote sensing without questioning its relevance to the complexity of the specific discrimination problem. Furthermore, identifying this many training samples is often problematic when many classes and/or many bands are used. It is important, then, to test how this generally accepted rule matches common remote sensing discrimination problems because it could be unnecessarily restrictive for many applications. This study was primarily conducted in order to test whether the general rule defining the relationship between n and p was well-suited for ML classification of a relatively simple remote sensing-based discrimination problem. To summarise the mean response of n-to-p for our study site, a Monte Carlo procedure was used to randomly stack various numbers of bands into thousands of separate image combinations that were then classified using an ML algorithm. The bands were randomly selected from a 119-band Enhanced Thematic Mapper-plus (ETM+) dataset comprised of 17 images acquired during the 2001-2002 southern hemisphere summer agricultural growing season over an irrigation area in south-eastern Australia. Results showed that the number of training samples needed for accurate ML classification was much lower than the cur-rent widely accepted rule. Due to the asymptotic nature of the relationship, we found that 95% of the accuracy attained using n = 30p samples could be achieved by using approximately 2p to 4p samples, or <= 1/7th the currently recommended value of n. Our findings show that the number of training samples needed for a simple discrimination problem is much less than that defined by the general rule and therefore the rule should not be universally enforced; the number of training samples needed should also be determined by considering the complexity of the discrimination problem. (C) 2005 Elsevier Inc. All rights reserved.
引用
收藏
页码:468 / 480
页数:13
相关论文
共 37 条
[11]  
HEPNER GF, 1990, PHOTOGRAMM ENG REM S, V56, P469
[12]   ON MEAN ACCURACY OF STATISTICAL PATTERN RECOGNIZERS [J].
HUGHES, GF .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1968, 14 (01) :55-+
[13]   OPTIMAL NUMBER OF FEATURES IN THE CLASSIFICATION OF MULTIVARIATE GAUSSIAN DATA [J].
JAIN, AK ;
WALLER, WG .
PATTERN RECOGNITION, 1978, 10 (5-6) :365-374
[14]  
James M., 1985, CLASSIFICATION ALGOR
[15]  
Jensen J.R., 1986, INTRO DIGITAL IMAGE
[16]  
JUPP DLB, 2001, UNPUB BACKGROUND ALG, P8
[17]   A comparison of multispectral and multitemporal information in high spatial resolution imagery for classification of individual tree species in a temperate hardwood forest [J].
Key, T ;
Warner, TA ;
McGraw, JB ;
Fajvan, MA .
REMOTE SENSING OF ENVIRONMENT, 2001, 75 (01) :100-112
[18]  
Koukoulas S, 2001, PHOTOGRAMM ENG REM S, V67, P499
[19]  
Lass LW, 2000, WEED TECHNOL, V14, P539, DOI 10.1614/0890-037X(2000)014[0539:AAIMIO]2.0.CO
[20]  
2