Model-based clustering of Gaussian copulas for mixed data

被引:24
作者
Marbac, Matthieu [1 ,2 ]
Biernacki, Christophe [1 ,2 ,3 ]
Vandewalle, Vincent [1 ,4 ]
机构
[1] Inria Lille, 40 Ave Halley, F-59650 Villeneuve Dascq, France
[2] Univ Lille 1, Villeneuve Dascq, France
[3] CNRS, Paris, France
[4] Univ Lille 2, EA 2694, Lille, France
关键词
Clustering; Gaussian copula; Metropolis-within-Gibbs algorithm; mixed data; mixture models; visualization; MIXTURE MODEL; BAYESIAN-INFERENCE; LIKELIHOOD; VARIABLES; MARGINS;
D O I
10.1080/03610926.2016.1277753
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Clustering of mixed data is important yet challenging due to a shortage of conventional distributions for such data. In this article, we propose a mixture model of Gaussian copulas for clustering mixed data. Indeed copulas, and Gaussian copulas in particular, are powerful tools for easily modeling the distribution of multivariate variables. This model clusters data sets with continuous, integer, and ordinal variables (all having a cumulative distribution function) by considering the intra-component dependencies in a similar way to the Gaussian mixture. Indeed, each component of the Gaussian copula mixture produces a correlation coefficient for each pair of variables and its univariate margins follow standard distributions (Gaussian, Poisson, and ordered multinomial) depending on the nature of the variable (continuous, integer, or ordinal). As an interesting by-product, this model generalizes many well-known approaches and provides tools for visualization based on its parameters. The Bayesian inference is achieved with a Metropolis-within-Gibbs sampler. The numerical experiments, on simulated and real data, illustrate the benefits of the proposed model: flexible and meaningful parameterization combined with visualization features.
引用
收藏
页码:11635 / 11656
页数:22
相关论文
共 40 条
[11]   A FINITE MIXTURE MODEL FOR THE CLUSTERING OF MIXED-MODE DATA [J].
EVERITT, BS .
STATISTICS & PROBABILITY LETTERS, 1988, 6 (05) :305-309
[12]   EXPLORATORY LATENT STRUCTURE-ANALYSIS USING BOTH IDENTIFIABLE AND UNIDENTIFIABLE MODELS [J].
GOODMAN, LA .
BIOMETRIKA, 1974, 61 (02) :215-231
[13]  
Hand DJ, 2001, INT STAT REV, V69, P385, DOI 10.1111/j.1751-5823.2001.tb00465.x
[14]   EXTENDING THE RANK LIKELIHOOD FOR SEMIPARAMETRIC COPULA ESTIMATION [J].
Hoff, Peter D. .
ANNALS OF APPLIED STATISTICS, 2007, 1 (01) :265-283
[15]  
Hunt L, 1999, AUST NZ J STAT, V41, P153
[16]   Clustering mixed data [J].
Hunt, Lynette ;
Jorgensen, Murray .
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2011, 1 (04) :352-361
[17]   Model-based clustering for multivariate partial ranking data [J].
Jacques, Julien ;
Biernacki, Christophe .
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2014, 149 :201-217
[18]   Asymptotic efficiency of the two-stage estimation method for copula-based models [J].
Joe, H .
JOURNAL OF MULTIVARIATE ANALYSIS, 2005, 94 (02) :401-419
[19]   Exact Bayesian modeling for bivariate Poisson data and extensions [J].
Karlis, Dimitris ;
Tsiamyrtzis, Panagiotis .
STATISTICS AND COMPUTING, 2008, 18 (01) :27-40
[20]   Efficient estimation in the bivariate normal copula model: normal margins are least favourable [J].
Klaassen, CAJ ;
Wellner, JA .
BERNOULLI, 1997, 3 (01) :55-77