Gaussian Mixture Models in R

被引:0
作者
Chassagnol, Bastien [1 ]
Bichat, Antoine [2 ]
Boudjeniba, Cheima [3 ]
Wuillemin, Pierre-Henri [4 ]
Guedj, Mickael [2 ]
Gohel, David [5 ]
Nuel, Gregory [1 ]
Becht, Etienne [2 ]
机构
[1] Sorbonne Univ, Lab Probabil Stat & Modelisat LPSM, UMR 8001, CNRS, 4 Pl Jussieu, F-75005 Paris, France
[2] Les Labs Servier, 50 Rue Carnot, F-92150 Suresnes, France
[3] Inst Pasteur, Dept Computat Biol, Syst Biol Grp, 25 Rue Dr Roux, F-75015 Paris, France
[4] Sorbonne Univ, Lab Informat Paris LIP6 6, UMR 7606, 4 Pl Jussieu, F-75005 Paris, France
[5] ArData, 7 Rue Voltaire, F-92800 Puteaux La Defense, France
关键词
EM ALGORITHM; MAXIMUM-LIKELIHOOD;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Gaussian mixture models (GMMs) are widely used for modelling stochastic problems. Indeed, a wide diversity of packages have been developed in R. However, no recent review describing the main features offered by these packages and comparing their performances has been performed. In this article, we first introduce GMMs and the EM algorithm used to retrieve the parameters of the model and analyse the main features implemented among seven of the most widely used R packages. We then empirically compare their statistical and computational performances in relation with the choice of the initialisation algorithm and the complexity of the mixture. We demonstrate that the best estimation with well-separated components or with a small number of components with distinguishable modes is obtained with REBMIX initialisation, implemented in the rebmix package, while the best estimation with highly overlapping components is obtained with k-means or random initialisation. Importantly, we show that implementation details in the EM algorithm yield differences in the parameters' estimation. Especially, packages mixtools (Young et al. 2020) and Rmixmod (Langrognet et al. 2021) estimate the parameters of the mixture with smaller bias, while the RMSE and variability of the estimates is smaller with packages bgmm (Ewa Szczurek 2021) , EMCluster (W.-C. Chen and Maitra 2022) , GMKMcharlie (Liu 2021), flexmix (Gruen and Leisch 2022) and mclust (Fraley, Raftery, and Scrucca 2022). The comparison of these packages provides R users with useful recommendations for improving the computational and statistical performance of their clustering and for identifying common deficiencies. Additionally, we propose several improvements in the development of a future, unified mixture model package.
引用
收藏
页码:56 / 76
页数:21
相关论文
共 27 条
[1]  
[Anonymous], 2023, R Foundation for Statistical Computing
[2]   Mixtures of Factor Analyzers with Common Factor Loadings: Applications to the Clustering and Visualization of High-Dimensional Data [J].
Baek, Jangsun ;
McLachlan, Geoffrey J. ;
Flack, Lloyd K. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2010, 32 (07) :1298-1309
[3]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[4]  
Berge Laurent, 2019, HDclassif: High Dimensional Supervised Classification and Clustering
[5]   Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models [J].
Biernacki, C ;
Celeux, G ;
Govaert, G .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2003, 41 (3-4) :561-575
[6]   Robust supervised classification with mixture models: Learning from data with uncertain labels [J].
Bouveyron, Charles ;
Girard, Stephane .
PATTERN RECOGNITION, 2009, 42 (11) :2649-2658
[7]   A CLASSIFICATION EM ALGORITHM FOR CLUSTERING AND 2 STOCHASTIC VERSIONS [J].
CELEUX, G ;
GOVAERT, G .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1992, 14 (03) :315-332
[8]   Consistency of the MLE under Mixture Models [J].
Chen, Jiahua .
STATISTICAL SCIENCE, 2017, 32 (01) :47-63
[9]   The distribution of quadratic forms in a normal system, with applications to the analysis of covariance. [J].
Cochran, WG .
PROCEEDINGS OF THE CAMBRIDGE PHILOSOPHICAL SOCIETY, 1934, 30 :178-191
[10]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38