On strategies for imbalanced text classification using SVM: A comparative study

被引:191
作者
Sun, Aixin [1 ]
Lim, Ee-Peng [2 ]
Liu, Ying [3 ]
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Singapore, Singapore
[2] Singapore Management Univ, Sch Informat Syst, Singapore, Singapore
[3] Hong Kong Polytech Univ, Dept Ind & Syst Engn, Hong Kong, Hong Kong, Peoples R China
关键词
Imbalanced text classification; Support Vector Machines; SVM; Resampling; Instance weighting;
D O I
10.1016/j.dss.2009.07.011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many real-world text classification tasks involve imbalanced training examples. The strategies proposed to address the imbalanced classification (e.g., resampling, instance weighting), however, have not been systematically evaluated in the text domain. In this paper, we conduct a comparative study on the effectiveness of these strategies in the context of imbalanced text classification using Support Vector Machines (SVM) classifier. SVM is the interest in this study for its good classification accuracy reported in many text classification tasks. We propose a taxonomy to organize all proposed strategies following the training and the test phases in text classification tasks. Based on the taxonomy, we survey the methods proposed to address the imbalanced classification. Among them, 10 commonly-used methods were evaluated in our experiments on three benchmark datasets, i.e., Reuters-21578, 20-Newsgroups, and WebKB. Using the area under the Precision-Recall Curve as the performance measure, our experimental results showed that the best decision surface was often learned by the standard SVM, not coupled with any of the proposed strategies. We believe such a negative finding will benefit both researchers and application developers in the area by focusing more on thresholding strategies. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:191 / 201
页数:11
相关论文
共 37 条
[1]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]  
[Anonymous], 2003, HP INVENT
[3]  
Brank Janez, 2003, MSRTR200334
[4]   A machine learning approach to web page filtering using content and structure analysis [J].
Chau, Michael ;
Chen, Hsinchun .
DECISION SUPPORT SYSTEMS, 2008, 44 (02) :482-494
[5]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[6]  
Chen CM, 2004, IEEE IJCNN, P2047
[7]   Introducing a family of linear measures for feature selection in text categorization [J].
Combarro, EF ;
Montañés, E ;
Díaz, I ;
Ranilla, J ;
Mones, R .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (09) :1223-1232
[8]  
Davis J., 2006, P 23 INT C MACH LEAR, P233, DOI [DOI 10.1145/1143844.1143874, 10.1145/1143844.1143874]
[9]  
Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
[10]   An integrated two-stage model for intelligent information routing [J].
Fan, Weiguo ;
Gordon, Michael D. ;
Pathak, Praveen .
DECISION SUPPORT SYSTEMS, 2006, 42 (01) :362-374