Tree-based text stream clustering with application to spam mail classification

被引:4
作者
Taninpong, Phimphaka [1 ]
Ngamsuriyaroj, Sudsanguan [1 ]
机构
[1] Mahidol Univ, Fac Informat & Commun Technol, Salaya 73170, Nakhon Pathom, Thailand
关键词
clustering; data mining; text clustering; text mining; text stream; tree-based clustering; spam; spam classification; text classification;
D O I
10.1504/IJDMMM.2018.095354
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a new text clustering algorithm based on a tree structure. The main idea of the clustering algorithm is a sub-tree at a specific node represents a document cluster. Our clustering algorithm is a single pass scanning algorithm which traverses down the tree to search for all clusters without having to predefine the number of clusters. Thus, it fits our objectives to produce document clusters having high cohesion, and to keep the minimum number of clusters. Moreover, an incremental learning process will perform after a new document is inserted into the tree, and the clusters will be rebuilt to accommodate the new information. In addition, we applied the proposed clustering algorithm to spam mail classification and the experimental results show that tree-based text clustering spam filter gives higher accuracy and specificity than the cobweb clustering, naive Bayes and KNN.
引用
收藏
页码:353 / 370
页数:18
相关论文
共 28 条
[1]   Clustering and classification of email contents [J].
Alsmadi, Izzat ;
Alhami, Ikdam .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2015, 27 (01) :46-57
[2]  
Ananthi S., 2009, J COMPUTER APPL, V2, P20
[3]  
Androutsopoulos I., 2000, P EUR C MACH LEARN, P9
[4]   Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres [J].
Banerjee, A ;
Ghosh, J .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2004, 15 (03) :702-719
[5]  
Beil F., 2002, C KNOWLEDGE DISCOVER, P436, DOI [DOI 10.1145/775047.775110, 10.1145/775047.775110, DOI 10.1145/3292500.3330672]
[6]  
Cheung W., 2002, THESIS
[7]  
Clark J, 2003, IEEE/WIC INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, P702
[8]  
Cormack G. V, 2007, P 16 TEXT RETR C GAI
[9]  
Cunningham P., 2003, P ICCBR 03 WORKSH LO
[10]   Support vector machines for spam categorization [J].
Drucker, H ;
Wu, DH ;
Vapnik, VN .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05) :1048-1054