Short text clustering based on Pitman-Yor process mixture model

被引:29
|
作者
Qiang, Jipeng [1 ]
Li, Yun [1 ]
Yuan, Yunhao [1 ]
Wu, Xindong [2 ,3 ]
机构
[1] Yangzhou Univ, Dept Comp Sci, Yangzhou, Jiangsu, Peoples R China
[2] Hefei Univ Technol, Dept Comp Sci, Hefei, Anhui, Peoples R China
[3] Univ Louisiana Lafayette, Sch Comp & Informat, Lafayette, LA 70504 USA
基金
中国国家自然科学基金;
关键词
LDA; Pitman-Yor process; Short text clustering; NONNEGATIVE MATRIX FACTORIZATION; ALGORITHMS;
D O I
10.1007/s10489-017-1055-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For finding the appropriate number of clusters in short text clustering, models based on Dirichlet Multinomial Mixture (DMM) require the maximum possible cluster number before inferring the real number of clusters. However, it is difficult to choose a proper number as we do not know the true number of clusters in short texts beforehand. The cluster distribution in DMM based on Dirichlet process as prior goes down exponentially as the number of clusters increases. Therefore, we propose a novel model based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution in the paper. Specifically, each text chooses one of the active clusters or a new cluster with probabilities derived from the Pitman-Yor Process Mixture model (PYPM). Discriminative words and nondiscriminative words are identified automatically to help enhance text clustering. Parameters are estimated efficiently by collapsed Gibbs sampling and experimental results show PYPM is robust and effective comparing with the state-of-the-art models.
引用
收藏
页码:1802 / 1812
页数:11
相关论文
共 50 条
  • [41] A Dirichlet process biterm-based mixture model for short text stream clustering
    Junyang Chen
    Zhiguo Gong
    Weiwen Liu
    Applied Intelligence, 2020, 50 : 1609 - 1619
  • [42] Truncated two-parameter Poisson-Dirichlet approximation for Pitman-Yor process hierarchical models
    Zhang, Junyi
    Dassios, Angelos
    SCANDINAVIAN JOURNAL OF STATISTICS, 2024, 51 (02) : 590 - 611
  • [43] A Novel 3D Model Recognition Approach using Pitman-Yor Process Mixtures of Beta-Liouville Distributions
    Fan, Wentao
    Al-Osaimi, Faisal R.
    Bouguila, Nizar
    2016 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2016, : 1986 - 1989
  • [44] WORD SEGMENTATION FROM PHONEME SEQUENCES BASED ON PITMAN-YOR SEMI-MARKOV MODEL EXPLOITING SUBWORD INFORMATION
    Takeda, Ryu
    Komatani, Kazunori
    Rudnicky, Alexander I.
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 763 - 770
  • [45] Hierarchical Dirichlet and Pitman-Yor process mixtures of shifted-scaled Dirichlet distributions for proportional data modeling
    Baghdadi, Ali
    Manouchehri, Narges
    Patterson, Zachary
    Fan, Wentao
    Bouguila, Nizar
    COMPUTATIONAL INTELLIGENCE, 2022, 38 (06) : 2095 - 2115
  • [46] Morpheme Level Hierarchical Pitman-Yor Class-based Language Models for LVCSR of Morphologically Rich Languages
    Mousa, Amr El-Desoky
    Shaik, M. Ali Basha
    Schlueter, Ralf
    Ney, Hermann
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3376 - 3380
  • [47] A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering
    Yin, Jianhua
    Wang, Jianyong
    PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 233 - 242
  • [48] Adaptive Bayesian Density Estimation in Lp-metrics with Pitman-Yor or Normalized Inverse-Gaussian Process Kernel Mixtures
    Scricciolo, Catia
    BAYESIAN ANALYSIS, 2014, 9 (02): : 475 - 520
  • [49] An Adaptive Dirichlet Multinomial Mixture Model for Short Text Streaming Clustering
    Duan, Ruting
    Li, Chunping
    2018 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2018), 2018, : 49 - 55
  • [50] Model-based Clustering of Short Text Streams
    Yin, Jianhua
    Chao, Daren
    Liu, Zhongkun
    Zhang, Wei
    Yu, Xiaohui
    Wang, Jianyong
    KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 2634 - 2642