High-dimensional count data clustering based on an exponential approximation to the multinomial Beta-Liouville distribution

被引:11
作者
Zamzami, Nuha [1 ,2 ]
Bouguila, Nizar [1 ]
机构
[1] Concordia Univ, Concordia Inst Informat Syst Engn CIISE, Montreal, PQ, Canada
[2] Univ Jeddah, Coll Comp Sci & Engn, Jeddah, Saudi Arabia
关键词
Exponential family; Finite mixtures; Model selection; Count data; CEM; Probabilistic kernels; SHAPES;
D O I
10.1016/j.ins.2020.03.028
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a mixture model for high-dimensional count data clustering based on an exponential-family approximation of the Multinomial Beta-Liouville distribution, which we call EMBL. We deal simultaneously with the problems of fitting the model to observed data and selecting the number of components. The learning algorithm automatically selects the optimal number of components and avoids several drawbacks of the standard Expectation-Maximization algorithm, including the sensitivity to initialization and possible convergence to the boundary of the parameter space. We demonstrate the effectiveness and robustness of the proposed clustering approach through a set of extensive empirical experiments that involve challenging real-world applications. The results reveal that the novel proposed model strives to achieve higher accuracy compared to the state-of-the-art generative models for count data clustering. Furthermore, the superior performance of EMBL demonstrates its flexibility and ability to address the burstiness phenomenon successfully, as well as shows its computational efficiency, especially when dealing with sparse high-dimensional vectors. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页码:116 / 135
页数:20
相关论文
共 50 条