Document clustering using locality preserving indexing

被引:564
作者
Cai, D
He, XF
Han, JW
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[2] Univ Chicago, Dept Comp Sci, Chicago, IL 60637 USA
关键词
document clustering; locality preserving indexing; dimensionality reduction; semantics;
D O I
10.1109/TKDE.2005.198
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using Locality Preserving Indexing (LPI), the documents can be projected into a lower-dimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on Latent Semantic Indexing (LSI) or Nonnegative Matrix Factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised Linear Discriminant Analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters-21578 and TDT2 data sets.
引用
收藏
页码:1624 / 1637
页数:14
相关论文
共 31 条
  • [11] A min-max cut algorithm for graph partitioning and data clustering
    Ding, CHQ
    He, XF
    Zha, HY
    Gu, M
    Simon, HD
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2001, : 107 - 114
  • [12] A similarity-based probability model for latent semantic indexing
    Ding, CHQ
    [J]. SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, : 58 - 65
  • [13] Duda R. O., 2000, PATTERN CLASSIFICATI
  • [14] Ester M., 1996, Proc. Second Int. Conf. Knowl. Discov. Data Min, P226, DOI DOI 10.5555/3001460.3001507
  • [15] FUNKUNAGA K, 1975, IEEE T COMPUT, V24, P750
  • [16] Golub G.H., 2013, Matrix Computations, V4th
  • [17] He X., 2003, Advances in neural information processing systems, P153
  • [18] HE X, 2004, P 27 ANN INT ACM SIG, P96
  • [19] Routing and wavelength assignment in GMPLS networks
    Hua, Y
    Xu, W
    Wu, CL
    [J]. PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PDCAT'2003, PROCEEDINGS, 2003, : 268 - 271
  • [20] Jain A.K., 1988, Algorithms for Clustering Data