Selection of the number of clusters in functional data analysis

被引:2
作者
Zambom, Adriano Zanin [1 ]
Alfonso Collazos, Julian [2 ]
Dias, Ronaldo [3 ]
机构
[1] Calif State Univ Northridge, Dept Math, 18111 Nordhoff St, Northridge, CA 91330 USA
[2] New Granada Mil Univ, Dept Math, Bogot, Colombia
[3] State Univ Campinas UNICAMP, Dept Stat, Sao Paulo, SP, Brazil
基金
巴西圣保罗研究基金会;
关键词
Parallelism; test statistic; K-means algorithm; ANOVA; clustering; DATA SET; MODEL; ALGORITHMS;
D O I
10.1080/00949655.2022.2053855
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Identifying the number K of clusters in a dataset is one of the most difficult problems in clustering analysis. A choice of K that correctly characterizes the features of the data is essential for building meaningful clusters. In this paper we tackle the problem of estimating the number of clusters in functional data analysis by introducing a new measure that can be used with different procedures in selecting the optimal K. The main idea is to use a combination of two test statistics, which measure the lack of parallelism and the mean distance between curves, to compute criteria such as the within and between cluster sum of squares. Simulations in challenging scenarios suggest that procedures using this measure can detect the correct number of clusters more frequently than existing methods in the literature. The application of the proposed method is illustrated on several real datasets.
引用
收藏
页码:2980 / 2998
页数:19
相关论文
共 54 条
[41]   A data-driven selection of the number of clusters in the Dirichlet allocation model via Bayesian mixture modelling [J].
Saraiva, E. F. ;
Pereira, C. A. B. ;
Suzuki, A. K. .
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2019, 89 (15) :2848-2870
[42]   Multilevel Functional Clustering Analysis [J].
Serban, Nicoleta ;
Jiang, Huijing .
BIOMETRICS, 2012, 68 (03) :805-814
[43]   Repeated observation of breast tumor subtypes in independent gene expression data sets [J].
Sorlie, T ;
Tibshirani, R ;
Parker, J ;
Hastie, T ;
Marron, JS ;
Nobel, A ;
Deng, S ;
Johnsen, H ;
Pesich, R ;
Geisler, S ;
Demeter, J ;
Perou, CM ;
Lonning, PE ;
Brown, PO ;
Borresen-Dale, AL ;
Botstein, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (14) :8418-8423
[44]   FULLY ADAPTIVE DENSITY-BASED CLUSTERING [J].
Steinwart, Ingo .
ANNALS OF STATISTICS, 2015, 43 (05) :2132-2167
[45]   Estimating the number of clusters in a data set via the gap statistic [J].
Tibshirani, R ;
Walther, G ;
Hastie, T .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2001, 63 :411-423
[46]   Crisp and fuzzy k-means clustering algorithms for multivariate functional data [J].
Tokushige, Shuichi ;
Yadohisa, Hiroshi ;
Inada, Koichi .
COMPUTATIONAL STATISTICS, 2007, 22 (01) :1-16
[47]   Functional k-means inverse regression [J].
Wang, Guochang ;
Lin, Nan ;
Zhang, Baoxue .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2014, 70 :172-182
[48]   Characteristic-based clustering for time series data [J].
Wang, Xiaozhe ;
Smith, Kate ;
Hyndman, Rob .
DATA MINING AND KNOWLEDGE DISCOVERY, 2006, 13 (03) :335-364
[49]   Functional factorial K-means analysis [J].
Yamamoto, Michio ;
Terada, Yoshikazu .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2014, 79 :133-148
[50]   Lag selection and model specification testing in nonparametric autoregressive conditional heteroscedastic models [J].
Zambom, Adriano Z. ;
Kim, Seonjin .
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2017, 186 :13-27