A study of spam filtering using support vector machines

被引:69
|
作者
Amayri, Ola [1 ]
Bouguila, Nizar [1 ]
机构
[1] Concordia Univ, Concordia Inst Informat Syst Engn, Montreal, PQ, Canada
关键词
Spam filtering; Support vector machines; String kernels; Feature mapping; Online active; CLASSIFICATION;
D O I
10.1007/s10462-010-9166-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Electronic mail is a major revolution taking place over traditional communication systems due to its convenient, economical, fast, and easy to use nature. A major bottleneck in electronic communications is the enormous dissemination of unwanted, harmful emails known as spam emails. A major concern is the developing of suitable filters that can adequately capture those emails and achieve high performance rate. Machine learning (ML) researchers have developed many approaches in order to tackle this problem. Within the context of machine learning, support vector machines (SVM) have made a large contribution to the development of spam email filtering. Based on SVM, different schemes have been proposed through text classification approaches (TC). A crucial problem when using SVM is the choice of kernels as they directly affect the separation of emails in the feature space. This paper presents thorough investigation of several distance-based kernels and specify spam filtering behaviors using SVM. The majority of used kernels in recent studies concern continuous data and neglect the structure of the text. In contrast to classical kernels, we propose the use of various string kernels for spam filtering. We show how effectively string kernels suit spam filtering problem. On the other hand, data preprocessing is a vital part of text classification where the objective is to generate feature vectors usable by SVM kernels. We detail a feature mapping variants in TC that yield improved performance for the standard SVM in filtering task. Furthermore, to cope for realtime scenarios we propose an online active framework for spam filtering. We present empirical results from an extensive study of online, transductive, and online active methods for classifying spam emails in real time. We show that active online method using string kernels achieves higher precision and recall rates.
引用
收藏
页码:73 / 108
页数:36
相关论文
共 50 条
  • [31] A Research on Using Support Vector Machine to Classify Chinese Spam
    Chi, He-Tsun
    Hsu, Yung-Ming
    Wan, Shien-Wen
    Wu, Yong-Yu
    Lin, Rui-Ting
    Chen, Jeanne
    Chen, Tung-Shou
    PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION AND MANAGEMENT SCIENCES, 2009, 8 : 479 - 483
  • [32] On the Study of Anomaly-based Spam Filtering Using Spam as Representation of Normality
    Laorden, Carlos
    Ugarte-Pedrero, Xabier
    Santos, Igor
    Sanz, Borja
    Nieves, Javier
    Bringas, Pablo G.
    2012 IEEE CONSUMER COMMUNICATIONS AND NETWORKING CONFERENCE (CCNC), 2012, : 693 - 695
  • [33] Improving performance of text categorization by combining filtering and support vector machines
    Díaz, I
    Ranilla, J
    Montañes, E
    Fernández, J
    Combarro, EF
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2004, 55 (07): : 579 - 592
  • [34] The study of credit evaluation of business websites using support vector machines
    Hu Guo-Sheng
    Zhang Guo-hong
    PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE & ENGINEERING (14TH) VOLS 1-3, 2007, : 263 - 267
  • [35] A Study on GPS GDOP Approximation Using Support-Vector Machines
    Wu, Chih-Hung
    Su, Wei-Han
    Ho, Ya-Wei
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2011, 60 (01) : 137 - 145
  • [36] Electrical Load Forecasting Using Support Vector Machines: a Case Study
    Turkay, Belgin Emre
    Demren, Dilara
    INTERNATIONAL REVIEW OF ELECTRICAL ENGINEERING-IREE, 2011, 6 (05): : 2411 - 2418
  • [37] Performance analysis of Naive Bayes classification, support vector machines and neural networks for spam categorization
    Tantug, A. C. neyd
    Eryigit, G. lsen
    APPLIED SOFT COMPUTING TECHNOLOGIES: THE CHALLENGE OF COMPLEXITY, 2006, 34 : 495 - 504
  • [38] Effective training of support vector machines using extractive support vector algorithm
    Yao, Chih-Chia
    Yu, Pao-Ta
    PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 1808 - +
  • [39] Speech Recognition using Support Vector Machines
    Aida-zade, Kamil
    Xocayev, Anar
    Rustamov, Samir
    2016 IEEE 10TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2016, : 108 - 111
  • [40] Using Support Vector Machines for numerical prediction
    Hussain, Shahid
    Khamisani, Vaqar
    INMIC 2007: PROCEEDINGS OF THE 11TH IEEE INTERNATIONAL MULTITOPIC CONFERENCE, 2007, : 88 - 92