Spam SMS filtering based on text features and supervised machine learning techniques

被引:0
作者
Muhammad Adeel Abid
Saleem Ullah
Muhammad Abubakar Siddique
Muhammad Faheem Mushtaq
Wajdi Aljedaani
Furqan Rustam
机构
[1] Khwaja Fareed University of Engineering and Information Technology,Department of Software Engineering, School of Systems and Technology
[2] The Islmia University of Bahwalpur,undefined
[3] University of North Texas,undefined
[4] University of Management and Technology,undefined
来源
Multimedia Tools and Applications | 2022年 / 81卷
关键词
SMS; Spam; Supervised machine learning; TF-IDF; Bag of words; Classification;
D O I
暂无
中图分类号
学科分类号
摘要
The advancement in technology made a significant mark with time, which affects every field of life like medicine, music, office, traveling, and communication. Telephone lines are used as a communication medium in ancient times. Currently, wireless technology overrides telephone wire technology with much broader features. The advertisement agencies and spammers mostly use SMS as a medium of communication to convey their business brochures to the typical person. Due to this reason, more than 60% of spam SMS are received daily. These spam messages cause users’ anger and sometimes scam with innocent users, but it creates large profits for the spammer and advertisement companies. This study proposed an approach for the classification of spam and ham SMS using supervised machine learning techniques. The feature extracting techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) and bag-of-words are used to extract features from data. The SMS dataset used was imbalanced, and to solve this problem, we used over-sampling and under-sampling techniques. The support vector classifier, gradient boosting machine, random forest, Gaussian Naive Bayes, and logistics regression are applied on the spam and ham SMS dataset to evaluate the performance using accuracy, precision, recall, and F1 score. The experiment result shows that the random forest classifies spam ham SMS more accurately with 99% accuracy. The proposed model is trained well to identify the SMS category in terms of Ham or Spam with TF-IDF features and oversampling technique. The performance of the proposed approach was also evaluated on the spam email dataset with significant 99% accuracy.
引用
收藏
页码:39853 / 39871
页数:18
相关论文
共 139 条
[71]  
Hsieh TC(undefined)undefined undefined undefined undefined-undefined
[72]  
Chen HH(undefined)undefined undefined undefined undefined-undefined
[73]  
Chen CH(undefined)undefined undefined undefined undefined-undefined
[74]  
Lin WC(undefined)undefined undefined undefined undefined-undefined
[75]  
Tsai CF(undefined)undefined undefined undefined undefined-undefined
[76]  
Hu YH(undefined)undefined undefined undefined undefined-undefined
[77]  
Jhang JS(undefined)undefined undefined undefined undefined-undefined
[78]  
Mujahid M(undefined)undefined undefined undefined undefined-undefined
[79]  
Lee E(undefined)undefined undefined undefined undefined-undefined
[80]  
Rustam F(undefined)undefined undefined undefined undefined-undefined