A modified bond energy algorithm with fuzzy merging and its application to Arabic text document clustering

被引:15
作者
AlMahmoud, Rana Husni [1 ]
Hammo, Bassam [1 ]
Faris, Hossam [1 ]
机构
[1] Univ Jordan, King Abdullah II Sch Informat Technol, Amman, Jordan
关键词
Bond energy algorithm; Arabic text document clustering; Fuzzy Merging; FEATURE-SELECTION; NUMBER;
D O I
10.1016/j.eswa.2020.113598
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventional textual documents clustering algorithms suffer from several shortcomings, such as the slow convergence of the immense high-dimensional data, the sensitivity to the initial value, and the understandability of the description of the resulted clusters. Although many clustering algorithms have been developed for English and other languages, very few have tackled the problem of clustering the under-resourced Arabic language. In this work, we propose a modified version of the Bond Energy Algorithm (BEA) combined with a fuzzy merging technique to solve the problem of Arabic text document clustering. The proposed algorithm, Clustering Arabic Documents based on Bond Energy, hereafter named CADBE, attempts to identify and display natural variable clusters within huge sized data. CADBE has three steps to cluster Arabic documents: the first step instantiates a cluster affinity matrix using the BEA, the second step uses a new and novel method to partition the cluster matrix automatically into small coherent clusters, and the last step uses a fuzzy merging technique to merge similar clusters based on the associations and interrelations between the resulted clusters. Experimental results showed that the proposed algorithm effectively outperformed the conventional clustering algorithms such as Expectation-Maximizati on (EM), Single Linkage, and UPGMA in terms of clustering purity and entropy. It also outperformed kmeans, k-means++, spherical k-means, and CoclusMod in most test cases. However, there are several merits of CADBE. First, unlike the traditional clustering algorithms, it does not require to specify the number of clusters. In addition, it produces clusters with distinct boundaries, which makes its results more objective, and finally it is deterministic, such that it is insensitive to the order in which documents are presented to the algorithm. (C) 2020 Elsevier Ltd. All rights reserved.
引用
收藏
页数:24
相关论文
共 117 条
[1]  
Abbas M., 2005, INT C REC ADV NAT LA
[2]   Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering [J].
Abualigah, Laith Mohammad ;
Khader, Ahamad Tajudin ;
Al-Betar, Mohammed Azmi ;
Alomari, Osama Ahmad .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 84 :24-36
[3]  
Aggarwal C. C., 2012, Mining Text Data, P163, DOI [10.1007/978-1-4614-3223-4, DOI 10.1007/978-1-4614-3223-4]
[4]  
Aggarwal C. C., 2015, Data mining: the textbook, DOI [DOI 10.1007/978-3-319-14142-8, 10.1007/978-3-319-14142-8]
[5]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[6]  
Agrawal R., 1994, P 20 INT C VER LARG, P487
[7]   Sparse Poisson Latent Block Model for Document Clustering [J].
Ailem, Melissa ;
Role, Francois ;
Nadif, Mohamed .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (07) :1563-1576
[8]   Graph modularity maximization as an effective method for co-clustering text data [J].
Ailem, Melissa ;
Role, Francois ;
Nadif, Mohamed .
KNOWLEDGE-BASED SYSTEMS, 2016, 109 :160-173
[9]  
Al-Anzi F.S., 2016, INT C ENG TECHN BIG, P1
[10]   A comprehensive survey of arabic sentiment analysis [J].
Al-Ayyoub, Mahmoud ;
Khamaiseh, Abed Allah ;
Jararweh, Yaser ;
Al-Kabi, Mohammed N. .
INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (02) :320-342