A modified bond energy algorithm with fuzzy merging and its application to Arabic text document clustering

被引:15
作者
AlMahmoud, Rana Husni [1 ]
Hammo, Bassam [1 ]
Faris, Hossam [1 ]
机构
[1] Univ Jordan, King Abdullah II Sch Informat Technol, Amman, Jordan
关键词
Bond energy algorithm; Arabic text document clustering; Fuzzy Merging; FEATURE-SELECTION; NUMBER;
D O I
10.1016/j.eswa.2020.113598
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventional textual documents clustering algorithms suffer from several shortcomings, such as the slow convergence of the immense high-dimensional data, the sensitivity to the initial value, and the understandability of the description of the resulted clusters. Although many clustering algorithms have been developed for English and other languages, very few have tackled the problem of clustering the under-resourced Arabic language. In this work, we propose a modified version of the Bond Energy Algorithm (BEA) combined with a fuzzy merging technique to solve the problem of Arabic text document clustering. The proposed algorithm, Clustering Arabic Documents based on Bond Energy, hereafter named CADBE, attempts to identify and display natural variable clusters within huge sized data. CADBE has three steps to cluster Arabic documents: the first step instantiates a cluster affinity matrix using the BEA, the second step uses a new and novel method to partition the cluster matrix automatically into small coherent clusters, and the last step uses a fuzzy merging technique to merge similar clusters based on the associations and interrelations between the resulted clusters. Experimental results showed that the proposed algorithm effectively outperformed the conventional clustering algorithms such as Expectation-Maximizati on (EM), Single Linkage, and UPGMA in terms of clustering purity and entropy. It also outperformed kmeans, k-means++, spherical k-means, and CoclusMod in most test cases. However, there are several merits of CADBE. First, unlike the traditional clustering algorithms, it does not require to specify the number of clusters. In addition, it produces clusters with distinct boundaries, which makes its results more objective, and finally it is deterministic, such that it is insensitive to the order in which documents are presented to the algorithm. (C) 2020 Elsevier Ltd. All rights reserved.
引用
收藏
页数:24
相关论文
共 117 条
[41]  
Ding C., 2006, AAAI, P137
[42]   A novel approach for initializing the spherical K-means clustering algorithm [J].
Duwairi, Rehab ;
Abu-Rahmeh, Mohammed .
SIMULATION MODELLING PRACTICE AND THEORY, 2015, 54 :49-63
[43]   SANAD: Single-label Arabic News Articles Dataset for automatic text categorization [J].
Einea, Omar ;
Elnagar, Ashraf ;
Al Debsi, Ridhwan .
DATA IN BRIEF, 2019, 25
[44]   A genetic programming based framework for churn prediction in telecommunication industry [J].
Faris, Hossam ;
Al-Shboul, Bashar ;
Ghatasheh, Nazeeh .
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8733 :353-362
[45]   Efficient stochastic algorithms for document clustering [J].
Forsati, Rana ;
Mahdavi, Mehrdad ;
Shamsfard, Mehrnoush ;
Meybodi, Mohammad Reza .
INFORMATION SCIENCES, 2013, 220 :269-291
[46]  
Fung BCM, 2003, SIAM PROC S, P59
[47]   Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm [J].
Gagolewski, Marek ;
Bartoszuk, Maciej ;
Cena, Anna .
INFORMATION SCIENCES, 2016, 363 :8-23
[48]  
Garima, 2015, 2015 2ND INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT (INDIACOM), P410
[49]  
GHANEM O, 2012, INT J COMPUTER APPL, V49, P5
[50]   A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA [J].
Gialampoukidis, Ilias ;
Vrochidis, Stefanos ;
Kompatsiaris, Ioannis .
MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION (MLDM 2016), 2016, 9729 :170-184