Arabic spam tweets classification using deep learning

被引:11
作者
Kaddoura, Sanaa [1 ]
Alex, Suja A. [2 ]
Itani, Maher [3 ]
Henno, Safaa [1 ]
AlNashash, Asma [4 ]
Hemanth, D. Jude [5 ]
机构
[1] Zayed Univ, Coll Technol Innovat, Dept Comp & Appl Technol, Abu Dhabi, U Arab Emirates
[2] St Xaviers Catholic Coll Engn, Dept Informat Technol, Nagercoil, India
[3] Comp Dept, Sabis Educ Serv, Acad Dev Div, Choueifat, Lebanon
[4] Princess Sumaya Univ Technol, King Hussein Sch Comp Sci, Dept Data Sci, Amman, Jordan
[5] Karunya Inst Technol & Sci, Dept ECE, Coimbatore, India
关键词
Spam; Tweets; Machine learning; Deep learning; Classification;
D O I
10.1007/s00521-023-08614-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the increased use of social network sites, such as Twitter, attackers exploit these platforms to spread counterfeit content. Such content can be fake advertisements or illegal content. Classifying such content is a challenging task, especially in Arabic. The Arabic language has a complex structure and makes classification tasks more difficult. This paper presents an approach to classifying Arabic tweets using classical machine learning (non-deep machine learning) and deep learning techniques. Tweets corpus were collected through Twitter API and labelled manually to get a reliable dataset. For an efficient classifier, feature extraction is applied to the corpus dataset. Then, two learning techniques are used for each feature extraction technique on the created dataset using N-gram models (uni-gram, bi-gram, and char-gram). The applied classical machine learning algorithms are support vector machines, neural networks, logistics regression, and naive Bayes. Global vector (GloVe) and fastText learning models are utilised for the deep learning approaches. The Precision, Recall, and F1-score are the suggested performance measures calculated in this paper. Afterwards, the dataset is increased using the synthetic minority oversampling technique class to create a balanced dataset. After applying the classical machine learning models, the experimental results show that the neural network algorithm outperforms the other algorithms. Moreover, the GloVe outperforms the fastText model for the deep learning approach.
引用
收藏
页码:17233 / 17246
页数:14
相关论文
共 56 条
[1]   Spam detection on Twitter using a support vector machine and users' features by identifying their interactions [J].
Ahmad, Saleh Beyt Sheikh ;
Rafie, Mahnaz ;
Ghorabie, Seyed Mojtaba .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (08) :11583-11605
[2]   Classification of Parkinson Disease Based on Patient's Voice Signal Using Machine Learning [J].
Ahmed, Imran ;
Aljahdali, Sultan ;
Khan, Muhammad Shakeel ;
Kaddoura, Sanaa .
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02) :705-722
[3]   Predicting Rogue Content and Arabic Spammers on Twitter [J].
Alharbi, Adel R. ;
Aljaedi, Amer .
FUTURE INTERNET, 2019, 11 (11)
[4]   Enhancing Detection of Arabic Social Spam Using Data Augmentation and Machine Learning [J].
Alkadri, Abdullah M. ;
Elkorany, Abeer ;
Ahmed, Cherry .
APPLIED SCIENCES-BASEL, 2022, 12 (22)
[5]  
Alom Z, 2018, 2018 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING (ASONAM), P1191, DOI 10.1109/ASONAM.2018.8508495
[6]  
[Anonymous], 2014, P 2014 C EMP METH NA, DOI DOI 10.3115/V1/D14-1162
[7]  
Benevenuto Fabricio., 2010, CEAS
[8]  
Berk Kardas, 2021, P 2021 IEEE ACM INT
[9]  
Bisio F., 2014, INT CARN CONF SECU, P1
[10]  
Bojanowski P., 2017, Trans ACL, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACLA00051, DOI 10.1162/TACL_A_00051]