Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

被引:24
作者
Abuaiadah, Diab [1 ,2 ]
机构
[1] Waikato Inst Technol, Hamilton, New Zealand
[2] Ctr Business Informat Technol & Enterprise, Waikato Mail Ctr, Ground Floor,E Block,Private Bag 3036, Hamilton 3240, New Zealand
关键词
Information retrieval; K-means; bisect K-means; Arabic stemmers; similarity measures;
D O I
10.1145/2812809
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this article, I have investigated the performance of the bisect K-means clustering algorithm compared to the standard K-means algorithm in the analysis of Arabic documents. The experiments included five commonly used similarity and distance functions (Pearson correlation coefficient, cosine, Jaccard coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) and three leading stemmers. Using the purity measure, the bisect K-means clearly outperformed the standard K-means in all settings with varying margins. For the bisect K-means, the best purity reached 0.927 when using the Pearson correlation coefficient function, while for the standard K-means, the best purity reached 0.884 when using the Jaccard coefficient function. Removing stop words significantly improved the results of the bisect K-means but produced minor improvements in the results of the standard K-means. Stemming provided additional minor improvement in all settings except the combination of the averaged Kullback-Leibler divergence function and the root-based stemmer, where the purity was deteriorated by more than 10%. These experiments were conducted using a dataset with nine categories, each of which contains 300 documents.
引用
收藏
页数:13
相关论文
共 32 条
  • [1] Abuaiadah D., 2014, INT J COMPUTER APPL, V101, P31, DOI [10.5120/17701-8680, DOI 10.5120/17701-8680]
  • [2] Al-Shammari Eiman Tamah, 2008, P 2 ACM WORKSH IMPR
  • [3] [Anonymous], ADV COMPUTATIONAL SC
  • [4] [Anonymous], 2013, INT J DATA MIN KNOWL, DOI DOI 10.5121/IJDKP.2013.3107
  • [5] [Anonymous], 2008, INTRO INFORM RETRIEV, DOI DOI 10.1017/CBO9780511809071
  • [6] Archetti F., 2006, HIERARCHICAL DOCUMEN
  • [7] Berkhin P., 2001, SURVEY CLUSTERING DA
  • [8] Bsoul QW, 2011, LECT NOTES COMPUT SC, V7097, P584, DOI 10.1007/978-3-642-25631-8_53
  • [9] Exploiting parallelism to support scalable hierarchical clustering
    Cathey, Rebecca J.
    Jensen, Eric C.
    Beitzel, Steven M.
    Frieder, Ophir
    Grossman, David
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2007, 58 (08): : 1207 - 1221
  • [10] Chen A., 2002, NIST SPECIAL PUBLICA