Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

被引：24

作者：

Abuaiadah, Diab ^{[1
,2
]}

机构：

[1] Waikato Inst Technol, Hamilton, New Zealand

[2] Ctr Business Informat Technol & Enterprise, Waikato Mail Ctr, Ground Floor,E Block,Private Bag 3036, Hamilton 3240, New Zealand

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2016年 / 15卷 / 03期

关键词：

Information retrieval; K-means; bisect K-means; Arabic stemmers; similarity measures;

D O I：

10.1145/2812809

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this article, I have investigated the performance of the bisect K-means clustering algorithm compared to the standard K-means algorithm in the analysis of Arabic documents. The experiments included five commonly used similarity and distance functions (Pearson correlation coefficient, cosine, Jaccard coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) and three leading stemmers. Using the purity measure, the bisect K-means clearly outperformed the standard K-means in all settings with varying margins. For the bisect K-means, the best purity reached 0.927 when using the Pearson correlation coefficient function, while for the standard K-means, the best purity reached 0.884 when using the Jaccard coefficient function. Removing stop words significantly improved the results of the bisect K-means but produced minor improvements in the results of the standard K-means. Stemming provided additional minor improvement in all settings except the combination of the averaged Kullback-Leibler divergence function and the root-based stemmer, where the purity was deteriorated by more than 10%. These experiments were conducted using a dataset with nine categories, each of which contains 300 documents.

引用

页数：13

共 32 条

[1] Abuaiadah D., 2014, INT J COMPUTER APPL, V101, P31, DOI [10.5120/17701-8680, DOI 10.5120/17701-8680]
[2] Al-Shammari Eiman Tamah, 2008, P 2 ACM WORKSH IMPR
[3] [Anonymous], ADV COMPUTATIONAL SC
[4] [Anonymous], 2013, INT J DATA MIN KNOWL, DOI DOI 10.5121/IJDKP.2013.3107
[5] [Anonymous], 2008, INTRO INFORM RETRIEV, DOI DOI 10.1017/CBO9780511809071
[6] Archetti F., 2006, HIERARCHICAL DOCUMEN
[7] Berkhin P., 2001, SURVEY CLUSTERING DA
[8] Bsoul QW, 2011, LECT NOTES COMPUT SC, V7097, P584, DOI 10.1007/978-3-642-25631-8_53
[9] Exploiting parallelism to support scalable hierarchical clustering
Cathey, Rebecca J.
Jensen, Eric C.
Beitzel, Steven M.
Frieder, Ophir
Grossman, David
[J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2007, 58 (08): : 1207 - 1221
[10] Chen A., 2002, NIST SPECIAL PUBLICA

← 1 2 3 4 →