Spam SMS filtering based on text features and supervised machine learning techniques

被引:0
作者
Muhammad Adeel Abid
Saleem Ullah
Muhammad Abubakar Siddique
Muhammad Faheem Mushtaq
Wajdi Aljedaani
Furqan Rustam
机构
[1] Khwaja Fareed University of Engineering and Information Technology,Department of Software Engineering, School of Systems and Technology
[2] The Islmia University of Bahwalpur,undefined
[3] University of North Texas,undefined
[4] University of Management and Technology,undefined
来源
Multimedia Tools and Applications | 2022年 / 81卷
关键词
SMS; Spam; Supervised machine learning; TF-IDF; Bag of words; Classification;
D O I
暂无
中图分类号
学科分类号
摘要
The advancement in technology made a significant mark with time, which affects every field of life like medicine, music, office, traveling, and communication. Telephone lines are used as a communication medium in ancient times. Currently, wireless technology overrides telephone wire technology with much broader features. The advertisement agencies and spammers mostly use SMS as a medium of communication to convey their business brochures to the typical person. Due to this reason, more than 60% of spam SMS are received daily. These spam messages cause users’ anger and sometimes scam with innocent users, but it creates large profits for the spammer and advertisement companies. This study proposed an approach for the classification of spam and ham SMS using supervised machine learning techniques. The feature extracting techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) and bag-of-words are used to extract features from data. The SMS dataset used was imbalanced, and to solve this problem, we used over-sampling and under-sampling techniques. The support vector classifier, gradient boosting machine, random forest, Gaussian Naive Bayes, and logistics regression are applied on the spam and ham SMS dataset to evaluate the performance using accuracy, precision, recall, and F1 score. The experiment result shows that the random forest classifies spam ham SMS more accurately with 99% accuracy. The proposed model is trained well to identify the SMS category in terms of Ham or Spam with TF-IDF features and oversampling technique. The performance of the proposed approach was also evaluated on the spam email dataset with significant 99% accuracy.
引用
收藏
页码:39853 / 39871
页数:18
相关论文
共 139 条
[1]  
Ahmed I(2014)Sms classification based on naive bayes classifier and apriori algorithm frequent itemset Int J Mach Learn Comput 4 183-1037
[2]  
Guan D(2020)Learning to rank developers for bug report assignment Appl Soft Comput 106667 95-357
[3]  
Chung TC(2008)Fast and incremental method for loop-closure detection using bags of visual words IEEE Trans Robot 24 1027-463
[4]  
Alkhazi B(2016)Assessing the performance of compression based clustering for text mining Econ Comput Econ Cybern Stud Res 50 2-7323
[5]  
DiStasi A(2002)Smote: synthetic minority over-sampling technique J Artif Intell Res 16 321-83
[6]  
Aljedaani W(2019)Multicast and broadcast enablers for high-performing cellular v2x systems IEEE Trans Broadcast 65 454-905
[7]  
Alrubaye H(2021)On the classification of bug reports to improve bug localization Soft Comput 25 7307-21
[8]  
Ye X(2019)An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks Information Fusion 48 67-3154
[9]  
Mkaouer MW(2018)Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary J Artif Intell Res 61 863-26
[10]  
Angeli A(2019)Modeling post-fire tree mortality using a logistic regression method within a forest landscape model Forests 10 25-87