Training SVM email classifiers using very large imbalanced dataset

被引:8
|
作者
Diao, Lili [1 ]
Yang, Chengzhong [2 ]
Wang, Hao [3 ]
机构
[1] Trend Micro Inc, Core Technol Res, Nanjing 210021, Jiangsu, Peoples R China
[2] Nanjing Univ, Sch Management & Engn, Nanjing 210093, Jiangsu, Peoples R China
[3] Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
关键词
email classification; support vector machine; imbalance learning; training set compression; undersampling;
D O I
10.1080/0952813X.2011.610033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Internet has been flooded with spam emails, and during the last decade there has been an increasing demand for reliable anti-spam email filters. The problem of filtering emails can be considered as a classification problem in the field of supervised learning. Theoretically, many mature technologies, for example, support vector machines (SVM), can be used to solve this problem. However, in real enterprise applications, the training data are typically collected via honeypots and thus are always of huge amounts and highly biased towards spam emails. This challenges both efficiency and effectiveness of conventional technologies. In this article, we propose an undersampling method to compress and balance the training set used for the conventional SVM classifier with minimal information loss. The key observation is that we can make a trade-off between training set size and information loss by carefully defining a similarity measure between data samples. Our experiments show that the SVM classifier provides a better performance by applying our compressing and balancing approach.
引用
收藏
页码:193 / 210
页数:18
相关论文
共 50 条
  • [1] Fast SVM training using edge detection on very large datasets
    Li, Boyang
    Wang, Qiangwei
    Hu, Jinglu
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2013, 8 (03) : 229 - 237
  • [2] Fast SVM training using data reconstruction for classification of very large datasets
    Liang, Peileng
    Li, Weite
    Hu, Jinglu
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2020, 15 (03) : 372 - 381
  • [3] An Empirical Method to Improve the Performance of the Classifiers on Imbalanced Dataset
    Babu, S.
    Narayanan, N. R. Anantha
    2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH, 2016, : 940 - 947
  • [4] Compression method based on training dataset of SVM
    Ban Xiaojuan
    Shen Qilong
    Chen Hao
    Tu Xuyan
    JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2008, 19 (01) : 198 - U4
  • [5] Compression method based on training dataset of SVM
    Ban Xiaojuan1
    Journal of Systems Engineering and Electronics, 2008, (01) : 198 - 201
  • [6] Online Evaluation of Email Streaming Classifiers Using GNUsmail
    Carmona-Cejudo, Jose M.
    Baena-Garcia, Manuel
    del Campo-Avila, Jose
    Bifet, Albert
    Gama, Joao
    Morales-Bueno, Rafael
    ADVANCES IN INTELLIGENT DATA ANALYSIS X: IDA 2011, 2011, 7014 : 90 - +
  • [7] Predicting the future transaction from large and imbalanced banking dataset
    Ilyas S.
    Zia S.
    Butt U.M.
    Letchmunan S.
    un Nisa Z.
    International Journal of Advanced Computer Science and Applications, 2020, 11 (01): : 273 - 286
  • [8] Predicting the Future Transaction from Large and Imbalanced Banking Dataset
    Ilyas, Sadaf
    Zia, Sultan
    Butt, Umair Muneer
    Letchmunan, Sukumar
    Nisa, Zaib Un
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (01) : 273 - 286
  • [9] An imbalanced training data SVM classification problem based on Riemannian metric
    Zhou Qifeng
    Lin Chengde
    Luo Linkai
    Peng Hong
    PROCEEDINGS OF THE 26TH CHINESE CONTROL CONFERENCE, VOL 4, 2007, : 554 - +
  • [10] Wilson's disease classification using higher-order Gabor tensors and various classifiers on a small and imbalanced brain MRI dataset
    Tiwari, Anurag
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (23) : 35121 - 35147