A Novel Stream Clustering Framework for Spam Detection in Twitter

被引:29
作者
Tajalizadeh, Hadi [1 ]
Boostani, Reza [2 ]
机构
[1] Shiraz Univ, Comp Sci & Engn Dept, Artificial Intelligence Grp, Elect & Comp Engn Fac, Shiraz 7134851154, Iran
[2] Shiraz Univ, Comp Sci & Engn Dept, Biomed Engn Grp, Elect & Comp Engn Fac, Shiraz 7134851154, Iran
关键词
Clustering classification; DenStream; incremental Naive Bayes (INB); spam detection; stream clustering; EMPIRICAL-EVALUATION; DESIGN;
D O I
10.1109/TCSS.2019.2910818
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Stream clustering methods have been repeatedly used for spam filtering in order to categorize input messages/tweets into spam and nonspam clusters. These methods assume each cluster contains a number of neighbor small (micro) clusters, where each microcluster has a symmetric distribution. Nonetheless, this assumption is not necessarily correct and big microclusters might have asymmetric distribution. To enhance the assigning accuracy of former methods in their online phase, we suggest replacing the Euclidean distance by a set of classifiers in order to assign incoming samples to the most relative microcluster with arbitrary distribution. Here, a set of incremental Naive Bayes (INB) classifier is trained for microclusters whose population exceeds a threshold. These INBs can capture the mean and boundary of microclusters, while the Euclidean distance just considers the mean of clusters and acts inaccurate for asymmetric big microclusters. In this paper, DenStream was promoted by the proposed framework, called here as INB-DenStream. To show the effectiveness of INB-DenStream, state-of-the-art methods such as DenStream, StreamKM++, and CluStream were applied to the Twitter datasets and their performance was determined in terms of purity, general precision, general recall, F1 measure, parameter sensitivity, and computational complexity. The compared results implied the superiority of our method to the rivals in almost the datasets.
引用
收藏
页码:525 / 534
页数:10
相关论文
共 32 条
[1]  
Ackermann MR, 2012, J EXP ALGORITHM, V17, P2, DOI [10.1145/2133803.2184450, DOI 10.1145/2133803.2184450]
[2]  
Aggarwal C.C., 2004, Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB '04
[3]  
[Anonymous], WASHINGTON POST MAR
[4]  
[Anonymous], 2003, P 29 INT C VER LARG
[5]  
Benevenuto Fabricio., 2010, CEAS
[6]  
Cao F., 2006, P SIAM INT C DAT MIN, P59
[7]   A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection [J].
Chen, Chao ;
Zhang, Jun ;
Xie, Yi ;
Xiang, Yang ;
Zhou, Wanlei ;
Hassan, Mohammad Mehedi ;
AlElaiwi, Abdulhameed ;
Alrubaian, Majed .
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2015, 2 (03) :65-76
[8]  
Ester M., 1996, P 2 INT C KNOWL DISC
[9]   Detection of spam-posting accounts on Twitter [J].
Inuwa-Dutse, Isa ;
Liptrott, Mark ;
Korkontzelos, Ioannis .
NEUROCOMPUTING, 2018, 315 :496-511
[10]  
Jindal N., 2008, P WSDM, V2008, P219, DOI DOI 10.1145/1341531.1341560