Training SVM email classifiers using very large imbalanced dataset

被引：8

作者：

Diao, Lili ^{[1
]}

Yang, Chengzhong ^{[2
]}

Wang, Hao ^{[3
]}

机构：

[1] Trend Micro Inc, Core Technol Res, Nanjing 210021, Jiangsu, Peoples R China

[2] Nanjing Univ, Sch Management & Engn, Nanjing 210093, Jiangsu, Peoples R China

[3] Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China

来源：

JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE | 2012年 / 24卷 / 02期

关键词：

email classification; support vector machine; imbalance learning; training set compression; undersampling;

D O I：

10.1080/0952813X.2011.610033

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Internet has been flooded with spam emails, and during the last decade there has been an increasing demand for reliable anti-spam email filters. The problem of filtering emails can be considered as a classification problem in the field of supervised learning. Theoretically, many mature technologies, for example, support vector machines (SVM), can be used to solve this problem. However, in real enterprise applications, the training data are typically collected via honeypots and thus are always of huge amounts and highly biased towards spam emails. This challenges both efficiency and effectiveness of conventional technologies. In this article, we propose an undersampling method to compress and balance the training set used for the conventional SVM classifier with minimal information loss. The key observation is that we can make a trade-off between training set size and information loss by carefully defining a similarity measure between data samples. Our experiments show that the SVM classifier provides a better performance by applying our compressing and balancing approach.

引用

页码：193 / 210

页数：18

共 50 条

[1] Fast SVM training using edge detection on very large datasets
Li, Boyang
Wang, Qiangwei
Hu, Jinglu
IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2013, 8 (03) : 229 - 237
[2] Fast SVM training using data reconstruction for classification of very large datasets
Liang, Peileng
Li, Weite
Hu, Jinglu
IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2020, 15 (03) : 372 - 381
[3] An Empirical Method to Improve the Performance of the Classifiers on Imbalanced Dataset
Babu, S.
Narayanan, N. R. Anantha
2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH, 2016, : 940 - 947
[4] Compression method based on training dataset of SVM
Ban Xiaojuan
Shen Qilong
Chen Hao
Tu Xuyan
JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2008, 19 (01) : 198 - U4
[5] Compression method based on training dataset of SVM
Ban Xiaojuan1
Journal of Systems Engineering and Electronics, 2008, (01) : 198 - 201
[6] Online Evaluation of Email Streaming Classifiers Using GNUsmail
Carmona-Cejudo, Jose M.
Baena-Garcia, Manuel
del Campo-Avila, Jose
Bifet, Albert
Gama, Joao
Morales-Bueno, Rafael
ADVANCES IN INTELLIGENT DATA ANALYSIS X: IDA 2011, 2011, 7014 : 90 - +
[7] Predicting the future transaction from large and imbalanced banking dataset
Ilyas S.
Zia S.
Butt U.M.
Letchmunan S.
un Nisa Z.
International Journal of Advanced Computer Science and Applications, 2020, 11 (01): : 273 - 286
[8] Predicting the Future Transaction from Large and Imbalanced Banking Dataset
Ilyas, Sadaf
Zia, Sultan
Butt, Umair Muneer
Letchmunan, Sukumar
Nisa, Zaib Un
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (01) : 273 - 286
[9] An imbalanced training data SVM classification problem based on Riemannian metric
Zhou Qifeng
Lin Chengde
Luo Linkai
Peng Hong
PROCEEDINGS OF THE 26TH CHINESE CONTROL CONFERENCE, VOL 4, 2007, : 554 - +
[10] Wilson's disease classification using higher-order Gabor tensors and various classifiers on a small and imbalanced brain MRI dataset
Tiwari, Anurag
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (23) : 35121 - 35147

← 1 2 3 4 5 →