Short text clustering based on Pitman-Yor process mixture model

被引:29
|
作者
Qiang, Jipeng [1 ]
Li, Yun [1 ]
Yuan, Yunhao [1 ]
Wu, Xindong [2 ,3 ]
机构
[1] Yangzhou Univ, Dept Comp Sci, Yangzhou, Jiangsu, Peoples R China
[2] Hefei Univ Technol, Dept Comp Sci, Hefei, Anhui, Peoples R China
[3] Univ Louisiana Lafayette, Sch Comp & Informat, Lafayette, LA 70504 USA
基金
中国国家自然科学基金;
关键词
LDA; Pitman-Yor process; Short text clustering; NONNEGATIVE MATRIX FACTORIZATION; ALGORITHMS;
D O I
10.1007/s10489-017-1055-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For finding the appropriate number of clusters in short text clustering, models based on Dirichlet Multinomial Mixture (DMM) require the maximum possible cluster number before inferring the real number of clusters. However, it is difficult to choose a proper number as we do not know the true number of clusters in short texts beforehand. The cluster distribution in DMM based on Dirichlet process as prior goes down exponentially as the number of clusters increases. Therefore, we propose a novel model based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution in the paper. Specifically, each text chooses one of the active clusters or a new cluster with probabilities derived from the Pitman-Yor Process Mixture model (PYPM). Discriminative words and nondiscriminative words are identified automatically to help enhance text clustering. Parameters are estimated efficiently by collapsed Gibbs sampling and experimental results show PYPM is robust and effective comparing with the state-of-the-art models.
引用
收藏
页码:1802 / 1812
页数:11
相关论文
共 50 条
  • [31] 3D Object Modeling and Recognition via Online Hierarchical Pitman-Yor Process Mixture Learning
    Fan, Wentao
    Al-Osaimi, Faisal R.
    Bouguila, Nizar
    Du, Ji-Xiang
    2015 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2015, : 448 - 452
  • [32] Genre-Based Music Language Modeling with Latent Hierarchical Pitman-Yor Process Allocation
    Raczynski, Stanislaw A.
    Vincent, Emmanuel
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (03) : 672 - 681
  • [33] Genre-based music language modeling with latent hierarchical Pitman-Yor process allocation
    1600, Institute of Electrical and Electronics Engineers Inc., United States (22):
  • [34] Preserving Unique Structural Blocks of Targets in ISAR Imaging by Pitman-Yor Process
    Cheng, Di
    Yuan, Bo
    Dai, Yulong
    Chen, Chang
    Chen, Weidong
    IEEE SENSORS JOURNAL, 2021, 21 (02) : 1859 - 1876
  • [35] Learning Terrain Types with the Pitman-Yor Process Mixtures of Gaussians for a Legged Robot
    Dallaire, Patrick
    Walas, Krzysztof
    Giguere, Philippe
    Chaib-draa, Brahim
    2015 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2015, : 3457 - 3463
  • [36] Unsupervised Learning of Agglutinated Morphology using Nested Pitman-Yor Process based Morpheme Induction Algorithm
    Kumar, Arun
    Padro, Liuis
    Oliver, Antoni
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 45 - 48
  • [37] Nonparametric Bayesian Methods and the Dependent Pitman-Yor Process for Modeling Evolution in Multiple Object Tracking
    Moraffah, Bahman
    Papandreou-Suppappola, Antonia
    Rangaswamy, Muralidhar
    2019 22ND INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2019), 2019,
  • [38] Online Learning of Concepts and Words Using Multimodal LDA and Hierarchical Pitman-Yor Language Model
    Araki, Takaya
    Nakamura, Tomoaki
    Nagai, Takayuki
    Nagasaka, Shogo
    Taniguchi, Tadahiro
    Iwahashi, Naoto
    2012 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2012, : 1623 - 1630
  • [39] Pitman–Yor process mixture model for community structure exploration considering latent interaction patterns
    王晶
    李侃
    Chinese Physics B, 2021, 30 (12) : 276 - 288
  • [40] A Dirichlet process biterm-based mixture model for short text stream clustering
    Chen, Junyang
    Gong, Zhiguo
    Liu, Weiwen
    APPLIED INTELLIGENCE, 2020, 50 (05) : 1609 - 1619