The Improved Logistic Regression Models for Spam Filtering

被引:7
作者
Han, Yong [1 ]
Yang, Muyun [2 ]
Qi, Haoliang [1 ]
He, Xiaoning [2 ]
Li, Sheng [2 ]
机构
[1] Heilongjiang Inst Technol, Comp Sci & Technol Dept, Harbin, Peoples R China
[2] Harbin Inst Technol, Comp Sci & Engn Sch, Harbin, Peoples R China
来源
2009 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING | 2009年
基金
中国国家自然科学基金;
关键词
spam filtering; improved logistic regression; online learning; byte level n-gram;
D O I
10.1109/IALP.2009.74
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The logistic regression model has achieved success in spam filtering. But it is disadvantaged by the equal adjustment of the feature weights appeared in both spam messages and ham ones during training period. This paper presents an improved logistic regression model which reduces the impact of the features appearing in both spam messages and ham ones. Byte level n-grams are employed to extract the features from messages, and TONE (Train On or Near Error) is adopted, which are proved effective in state-of-the-art spam filtering system. The official runs of CEAS (Conference on Email and Anti-Spam) Spam-filter Challenge 2008 show that the proposed model is one of the best methods. Our system achieved competitive results in all tasks and is the winner of active learning on the live stream by 1- ROCA.
引用
收藏
页码:314 / 317
页数:4
相关论文
共 8 条
[1]  
[Anonymous], SCI AM
[2]  
Boscovich R., 2008, MICROSOFT SECURITY I, V6
[3]  
Cormack G., 2007, The Sixteenth Text REtrieval Conference (TREC 2007) Public Corpus
[4]  
Cormack G. V., 2006, 5 C EM ANT CEAS 06
[5]  
Cormack G.V., 2005, 14 TEXT RETRIEVAL C
[6]   Support vector machines for spam categorization [J].
Drucker, H ;
Wu, DH ;
Vapnik, VN .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05) :1048-1054
[7]  
Joachims T., 1998, P 10 EUR C MACH LEAR, P137
[8]  
Sculley D., 2007, P 30 ANN INT ACM SIG