Learning to classify e-mail

被引:91
作者
Koprinska, Irena [1 ]
Poon, Josiah [1 ]
Clark, James [1 ]
Chan, Jason [1 ]
机构
[1] Univ Sydney, Sch Informat Technol, Sydney, NSW 2006, Australia
关键词
e-mail classification into folders; spam e-mail filtering; random forest; co-training; machine learning;
D O I
10.1016/j.ins.2006.12.005
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algorithms such as decision trees, support vector machines and naive Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naive Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co-training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits. (C) 2006 Elsevier Inc. All rights reserved.
引用
收藏
页码:2167 / 2187
页数:21
相关论文
共 41 条
[1]  
ANDROUTSOPOULOS I, 2000, P 23 ANN INT ACM SIG, P160
[2]  
[Anonymous], P AAAI SPRING S MACH
[3]  
[Anonymous], P 4 EUR C PRINC PRAC
[4]  
[Anonymous], 1993, P 13 INT JOINT C ART
[5]  
BLUM A, 1998, P WORKSH COMP LEARN
[6]  
Breiman L, 1996, MACH LEARN, V24, P49
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]  
CARRERAS X, 2001, P 4 INT C REC ADV NA
[9]  
Chen C., 2004, USING RANDOM FOREST
[10]  
Clark J, 2003, IEEE/WIC INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, P702