Short text clustering based on Pitman-Yor process mixture model

被引：29

作者：

Qiang, Jipeng ^{[1
]}

Li, Yun ^{[1
]}

Yuan, Yunhao ^{[1
]}

Wu, Xindong ^{[2
,3
]}

机构：

[1] Yangzhou Univ, Dept Comp Sci, Yangzhou, Jiangsu, Peoples R China

[2] Hefei Univ Technol, Dept Comp Sci, Hefei, Anhui, Peoples R China

[3] Univ Louisiana Lafayette, Sch Comp & Informat, Lafayette, LA 70504 USA

来源：

APPLIED INTELLIGENCE | 2018年 / 48卷 / 07期

基金：

中国国家自然科学基金;

关键词：

LDA; Pitman-Yor process; Short text clustering; NONNEGATIVE MATRIX FACTORIZATION; ALGORITHMS;

D O I：

10.1007/s10489-017-1055-4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

For finding the appropriate number of clusters in short text clustering, models based on Dirichlet Multinomial Mixture (DMM) require the maximum possible cluster number before inferring the real number of clusters. However, it is difficult to choose a proper number as we do not know the true number of clusters in short texts beforehand. The cluster distribution in DMM based on Dirichlet process as prior goes down exponentially as the number of clusters increases. Therefore, we propose a novel model based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution in the paper. Specifically, each text chooses one of the active clusters or a new cluster with probabilities derived from the Pitman-Yor Process Mixture model (PYPM). Discriminative words and nondiscriminative words are identified automatically to help enhance text clustering. Parameters are estimated efficiently by collapsed Gibbs sampling and experimental results show PYPM is robust and effective comparing with the state-of-the-art models.

引用

页码：1802 / 1812

页数：11

共 50 条

[41] A Dirichlet process biterm-based mixture model for short text stream clustering
Junyang Chen
Zhiguo Gong
Weiwen Liu
Applied Intelligence, 2020, 50 : 1609 - 1619
[42] Truncated two-parameter Poisson-Dirichlet approximation for Pitman-Yor process hierarchical models
Zhang, Junyi
Dassios, Angelos
SCANDINAVIAN JOURNAL OF STATISTICS, 2024, 51 (02) : 590 - 611
[43] A Novel 3D Model Recognition Approach using Pitman-Yor Process Mixtures of Beta-Liouville Distributions
Fan, Wentao
Al-Osaimi, Faisal R.
Bouguila, Nizar
2016 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2016, : 1986 - 1989
[44] WORD SEGMENTATION FROM PHONEME SEQUENCES BASED ON PITMAN-YOR SEMI-MARKOV MODEL EXPLOITING SUBWORD INFORMATION
Takeda, Ryu
Komatani, Kazunori
Rudnicky, Alexander I.
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 763 - 770
[45] Hierarchical Dirichlet and Pitman-Yor process mixtures of shifted-scaled Dirichlet distributions for proportional data modeling
Baghdadi, Ali
Manouchehri, Narges
Patterson, Zachary
Fan, Wentao
Bouguila, Nizar
COMPUTATIONAL INTELLIGENCE, 2022, 38 (06) : 2095 - 2115
[46] Morpheme Level Hierarchical Pitman-Yor Class-based Language Models for LVCSR of Morphologically Rich Languages
Mousa, Amr El-Desoky
Shaik, M. Ali Basha
Schlueter, Ralf
Ney, Hermann
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3376 - 3380
[47] A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering
Yin, Jianhua
Wang, Jianyong
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 233 - 242
[48] Adaptive Bayesian Density Estimation in Lp-metrics with Pitman-Yor or Normalized Inverse-Gaussian Process Kernel Mixtures
Scricciolo, Catia
BAYESIAN ANALYSIS, 2014, 9 (02): : 475 - 520
[49] An Adaptive Dirichlet Multinomial Mixture Model for Short Text Streaming Clustering
Duan, Ruting
Li, Chunping
2018 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2018), 2018, : 49 - 55
[50] Model-based Clustering of Short Text Streams
Yin, Jianhua
Chao, Daren
Liu, Zhongkun
Zhang, Wei
Yu, Xiaohui
Wang, Jianyong
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 2634 - 2642

← 1 2 3 4 5 →