A countably infinite mixture model for clustering and feature selection

被引:0
作者
Nizar Bouguila
Djemel Ziou
机构
[1] CIISE,
[2] Concordia University,undefined
[3] Université de Sherbrooke,undefined
来源
Knowledge and Information Systems | 2012年 / 33卷
关键词
Non-parametric Bayesian methods; Dirichlet process; Clustering; Feature selection; Mixture models; Generalized Dirichlet; MCMC; Categorization;
D O I
暂无
中图分类号
学科分类号
摘要
Mixture modeling is one of the most useful tools in machine learning and data mining applications. An important challenge when applying finite mixture models is the selection of the number of clusters which best describes the data. Recent developments have shown that this problem can be handled by the application of non-parametric Bayesian techniques to mixture modeling. Another important crucial preprocessing step to mixture learning is the selection of the most relevant features. The main approach in this paper, to tackle these problems, consists on storing the knowledge in a generalized Dirichlet mixture model by applying non-parametric Bayesian estimation and inference techniques. Specifically, we extend finite generalized Dirichlet mixture models to the infinite case in which the number of components and relevant features do not need to be known a priori. This extension provides a natural representation of uncertainty regarding the challenging problem of model selection. We propose a Markov Chain Monte Carlo algorithm to learn the resulted infinite mixture. Through applications involving text and image categorization, we show that infinite mixture models offer a more powerful and robust performance than classic finite mixtures for both clustering and feature selection.
引用
收藏
页码:351 / 370
页数:19
相关论文
共 52 条
[1]  
Bouguila N(2006)Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach IEEE Trans Knowl Data Eng 18 993-1009
[2]  
Ziou D(2007)High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length IEEE Trans Pattern Anal Mach Intell 29 1716-1731
[3]  
Bouguila N(2010)Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space Knowl Inf Syst 22 1-21
[4]  
Ziou D(2009)Subspace and projected clustering: experimental evaluation and analysis Knowl Inf Syst 21 299-326
[5]  
Zhang M(2010)Image annotation technique based on feature selection for class-pairs Knowl Inf Syst 24 325-337
[6]  
Alhajj R(1995)Bayesian density estimation and inference using mixtures J Am Stat Assoc 90 577-588
[7]  
Moise G(2000)Markov Chain sampling methods for Dirichlet process mixture models J Comput Graph Stat 9 249-265
[8]  
Zimek A(2006)Hierarchical Dirichlet processes J Am Stat Assoc 101 1566-1581
[9]  
Kröger P(2008)Clustering of count data using generalized Dirichlet multinomial distributions IEEE Trans Knowl Data Eng 20 462-474
[10]  
Kriegel H-P(2009)On Bayesian analysis of a finite generalized Dirichlet mixture via a metropolis-within-gibbs sampling Pattern Anal Appl 12 151-166