A Novel Stream Clustering Framework for Spam Detection in Twitter

被引：29

作者：

Tajalizadeh, Hadi ^{[1
]}

Boostani, Reza ^{[2
]}

机构：

[1] Shiraz Univ, Comp Sci & Engn Dept, Artificial Intelligence Grp, Elect & Comp Engn Fac, Shiraz 7134851154, Iran

[2] Shiraz Univ, Comp Sci & Engn Dept, Biomed Engn Grp, Elect & Comp Engn Fac, Shiraz 7134851154, Iran

来源：

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS | 2019年 / 6卷 / 03期

关键词：

Clustering classification; DenStream; incremental Naive Bayes (INB); spam detection; stream clustering; EMPIRICAL-EVALUATION; DESIGN;

D O I：

10.1109/TCSS.2019.2910818

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Stream clustering methods have been repeatedly used for spam filtering in order to categorize input messages/tweets into spam and nonspam clusters. These methods assume each cluster contains a number of neighbor small (micro) clusters, where each microcluster has a symmetric distribution. Nonetheless, this assumption is not necessarily correct and big microclusters might have asymmetric distribution. To enhance the assigning accuracy of former methods in their online phase, we suggest replacing the Euclidean distance by a set of classifiers in order to assign incoming samples to the most relative microcluster with arbitrary distribution. Here, a set of incremental Naive Bayes (INB) classifier is trained for microclusters whose population exceeds a threshold. These INBs can capture the mean and boundary of microclusters, while the Euclidean distance just considers the mean of clusters and acts inaccurate for asymmetric big microclusters. In this paper, DenStream was promoted by the proposed framework, called here as INB-DenStream. To show the effectiveness of INB-DenStream, state-of-the-art methods such as DenStream, StreamKM++, and CluStream were applied to the Twitter datasets and their performance was determined in terms of purity, general precision, general recall, F1 measure, parameter sensitivity, and computational complexity. The compared results implied the superiority of our method to the rivals in almost the datasets.

引用

页码：525 / 534

页数：10

共 32 条

[1]

Ackermann MR, 2012, J EXP ALGORITHM, V17, P2, DOI [10.1145/2133803.2184450, DOI 10.1145/2133803.2184450]

[2]

Aggarwal C.C., 2004, Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB '04

[3]

[Anonymous], WASHINGTON POST MAR

[4]

[Anonymous], 2003, P 29 INT C VER LARG

[5]

Benevenuto Fabricio., 2010, CEAS

[6]

Cao F., 2006, P SIAM INT C DAT MIN, P59

[7] A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection [J].

Chen, Chao ;

Zhang, Jun ;

Xie, Yi ;

Xiang, Yang ;

Zhou, Wanlei ;

Hassan, Mohammad Mehedi ;

AlElaiwi, Abdulhameed ;

Alrubaian, Majed .

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2015, 2 (03) :65-76

[8]

Ester M., 1996, P 2 INT C KNOWL DISC

[9] Detection of spam-posting accounts on Twitter [J].

Inuwa-Dutse, Isa ;

Liptrott, Mark ;

Korkontzelos, Ioannis .

NEUROCOMPUTING, 2018, 315 :496-511

[10]

Jindal N., 2008, P WSDM, V2008, P219, DOI DOI 10.1145/1341531.1341560

← 1 2 3 4 →