Enhancing the Naive Bayes Spam Filter through Intelligent Text Modification Detection

被引:13
作者
Huang, Linda [1 ]
Jia, Julia [1 ]
Ingram, Emma [1 ]
Peng, Wuxu [2 ]
机构
[1] 2017 Honors Summer Math Camp, San Marcos, TX 78666 USA
[2] Texas State Univ, Dept Comp Sci, San Marcos, TX USA
来源
2018 17TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (IEEE TRUSTCOM) / 12TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (IEEE BIGDATASE) | 2018年
关键词
Email; Spam; Spam Filter; Bayes Spam Filter; Naive Bayes Classifier; Spamassassin; Text Classification; Bayesian Poisoning;
D O I
10.1109/TrustCom/BigDataSE.2018.00122
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spam emails have been a chronic issue in computer security. They are very costly economically and extremely dangerous for computers and networks. Despite of the emergence of social networks and other Internet based information exchange venues, dependence on email communication has increased over the years and this dependence has resulted in an urgent need to improve spam filters. Although many spam filters have been created to help prevent these spam emails from entering a user's inbox, there is a lack or research focusing on text modifications. Currently, Naive Bayes is one of the most popular methods of spam classification because of its simplicity and efficiency. Naive Bayes is also very accurate; however, it is unable to correctly classify emails when they contain leetspeak or diacritics. Thus, in this proposes, we implemented a novel algorithm for enhancing the accuracy of the Naive Bayes Spam Filter so that it can detect text modifications and correctly classify the email as spam or ham. Our Python algorithm combines semantic based, keyword based, and machine learning algorithms to increase the accuracy of Naive Bayes compared to Spamassassin by over two hundred percent. Additionally, we have discovered a relationship between the length of the email and the spam score, indicating that Bayesian Poisoning, a controversial topic, is actually a real phenomenon and utilized by spammers.
引用
收藏
页码:849 / 854
页数:6
相关论文
共 20 条
[1]  
AHMED S, 2004, P 1 C EM ANT CEAS
[2]  
Aktar J., 2014, INT SCHOLARLY RES NO, V2014
[3]  
[Anonymous], 2012, P 29 INT COFERENCE I
[4]  
Asa K., 1999, TEXT CATEGORISATION
[5]  
Aski Ali Shafigh, 2016, Pacific Science Review A: Natural Science and Engineering, V18, P145, DOI 10.1016/j.psra.2016.09.017
[6]  
Bhowmick A., 2016, MACHINE LEARNING E M
[7]  
Cormack G., 2007, J FDN TRENDS INFORM, V1, P335
[8]  
Eberhardt J., 2015, U MINNESOTA MORRIS D, V2
[9]  
Graham-Cumming J., 2006, DOES BAYESIAN POISON
[10]  
Li C., 2015, P 7 INT C INT HUM MA