Generating information for small data sets with a multi-modal distribution

被引:21
作者
Li, Der-Chiang [1 ]
Lin, Liang-Sian [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Ind & Informat Management, Tainan 70101, Taiwan
关键词
Multi-modal distribution; Small data set; Multi-modal virtual sample; Virtual sample size; WEIBULL DISTRIBUTION; TREND-DIFFUSION; CLASSIFICATION; ALGORITHM;
D O I
10.1016/j.dss.2014.06.004
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Virtual sample generation approaches have been used with small data sets to enhance classification performance in a number of reports. The appropriate estimation of data distribution plays an important role in this process, with performance usually better for data sets that have a simple distribution rather than a complex one. Mixed-type data sets often have a multi-modal distribution instead of a simple, uni-modal one. This study thus proposes a new approach to detect multi-modality in data sets, to avoid the problem of inappropriately using a uni-modal distribution. We utilize the common k-means clustering method to detect possible clusters, and, based on the clustered sample sets, a Weibull variate is developed for each of these to produce multi-modal virtual data. In this approach, the degree of error variation in the Weibull skewness between the original and virtual data is measured and used as the criterion for determining the sizes of virtual samples. Six data sets with different training data sizes are employed to check the performance of the proposed method, and comparisons are made based on the classification accuracies. The results using non-parametric testing show that the proposed method has better classification performance to that of the recently presented Mega-Trend-Diffusion method. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:71 / 81
页数:11
相关论文
共 35 条
[1]  
Abernethy R.B., 2004, The new Weibull handbook, V5e ed
[2]  
[Anonymous], 2007, Uci machine learning repository
[3]   Weibull distributions when the shape parameter is defined [J].
Bowman, KO ;
Shenton, LR .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2001, 36 (03) :299-310
[4]   Using Evidence of Mixed Populations to Select Variables for Clustering Very High-Dimensional Data [J].
Chan, Yao-ban ;
Hall, Peter .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2010, 105 (490) :798-809
[5]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[6]  
Cheng MY, 1999, ANN STAT, V27, P1294
[7]   Interweaving Kohonen Maps of Different Dimensions to Handle Measure Zero Constraints on Topological Mappings [J].
L. Manevitz .
Neural Processing Letters, 1997, 5 (2) :83-89
[8]   An efficient discriminant-based solution for small sample size problem [J].
Das, Koel ;
Nenadic, Zoran .
PATTERN RECOGNITION, 2009, 42 (05) :857-866
[9]   Densities, spectral densities and modality [J].
Davies, PL ;
Kovac, A .
ANNALS OF STATISTICS, 2004, 32 (03) :1093-1136
[10]  
Demsar J, 2006, J MACH LEARN RES, V7, P1