Variable Selection for Mixed Data Clustering: Application in Human Population Genomics

被引:0
作者
Matthieu Marbac
Mohammed Sedki
Tienne Patin
机构
[1] Ensai,CREST
[2] University of Paris-Sud,UMR Inserm
[3] Institut Pasteur,1181
来源
Journal of Classification | 2020年 / 37卷
关键词
Human evolutionary genetics; Information criterion; Mixed data; Model-based clustering; Variable selection;
D O I
暂无
中图分类号
学科分类号
摘要
Model-based clustering of human population genomic data, composed of 1,318 individuals arisen from western Central Africa and 160,470 markers, is considered. This challenging analysis leads us to develop a new methodology for variable selection in clustering. To explain the differences between subpopulations and to increase the accuracy of the estimates, variable selection is done simultaneously to clustering. We proposed two approaches for selecting variables when clustering is managed by the latent class model (i.e., mixture considering independence within components). The first method simultaneously performs model selection and parameter inference. It optimizes the Bayesian Information Criterion with a modified version of the standard expectation–maximization algorithm. The second method performs model selection without requiring parameter inference by maximizing the Maximum Integrated Complete-data Likelihood criterion. Although the application considers categorical data, the proposed methods are introduced in the general context of mixed data (data composed of different types of features). As the first step, the interest of both proposed methods is shown on simulated and several benchmark real data. Then, we apply the clustering method to the human population genomic data which permits to detect the most discriminative genetic markers. The proposed method implemented in the R package VarSelLCM is available on CRAN.
引用
收藏
页码:124 / 142
页数:18
相关论文
共 90 条
  • [1] Andrews JL(2014)Variable selection for clustering and classification Journal of Classification 31 136-153
  • [2] McNicholas PD(2000)Assessing a mixture model for clustering with the integrated completed likelihood IEEE Transactions on Pattern Analysis and Machine Intelligence 22 719-725
  • [3] Biernacki C(2010)Exact and Monte Carlo calculations of integrated likelihoods for the latent class model Journal of Statistical Planning and Inference 140 2991-3002
  • [4] Celeux G(2013)Clustering and variable selection for categorical multivariate data Electronic Journal of Statistics 7 2344-2371
  • [5] Govaert G(1991)Clustering criteria for discrete data and latent class models Journal of Classification 8 157-176
  • [6] Biernacki C(2009)Comparing model selection and regularization approaches to variable selection in model-based clustering Journal de la Societe francaise de statistique 155 57-35
  • [7] Celeux G(2010)Latent class analysis variable selection Annals of the Institute of Statistical Mathematics 62 11-38
  • [8] Govaert G(1977)Maximum likelihood from incomplete data via the EM algorithm Journal of the Royal Statistical Society. Series B (Methodological) 39 1-2110
  • [9] Bontemps D(2017)Variable selection for latent class analysis with application to low back pain diagnosis The Annals of Applied Statistics 11 2080-228
  • [10] Toussile W(1988)Variable selection in clustering Journal of Classification 5 205-308