The impact of preprocessing on text classification

被引:393
作者
Uysal, Alper Kursat [1 ]
Gunal, Serkan [1 ]
机构
[1] Anadolu Univ, Dept Comp Engn, Eskisehir, Turkey
关键词
Pattern recognition; Text categorization; Text classification; Text preprocessing; FEATURE-SELECTION; ALGORITHM; MODEL;
D O I
10.1016/j.ipm.2013.08.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:104 / 112
页数:9
相关论文
共 33 条
[1]  
[Anonymous], 1997, ICML
[2]  
[Anonymous], P INT C MACH LEARN
[3]  
Asuncion A., 2007, UCI MACHINE LEARNIN
[4]   Information retrieval on Turkish texts [J].
Can, Fazli ;
Kocberber, Seyit ;
Balcik, Erman ;
Kaynak, Cihan ;
Ocalan, H. Cagdas ;
Vursavas, Onur M. .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2008, 59 (03) :407-421
[5]   Using chi-square statistics to measure similarities for text categorization [J].
Chen, Yao-Tsung ;
Chen, Meng Chang .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (04) :3085-3090
[6]   Author gender identification from text [J].
Cheng, Na ;
Chandramouli, R. ;
Subbalakshmi, K. P. .
DIGITAL INVESTIGATION, 2011, 8 (01) :78-88
[7]   Feature Reduction Techniques for Arabic Text Categorization [J].
Duwairi, Rehab ;
Al-Refai, Mohammad Nayef ;
Khasawneh, Natheer .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (11) :2347-2352
[8]  
Ergin S., 2012, AWERProcedia Information Technology and Computer Science, V1, P1007
[9]   A Bayesian feature selection paradigm for text classification [J].
Feng, Guozhong ;
Guo, Jianhua ;
Jing, Bing-Yi ;
Hao, Lizhu .
INFORMATION PROCESSING & MANAGEMENT, 2012, 48 (02) :283-302
[10]  
Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670