The impact of preprocessing on text classification

被引:373
作者
Uysal, Alper Kursat [1 ]
Gunal, Serkan [1 ]
机构
[1] Anadolu Univ, Dept Comp Engn, Eskisehir, Turkey
关键词
Pattern recognition; Text categorization; Text classification; Text preprocessing; FEATURE-SELECTION; ALGORITHM; MODEL;
D O I
10.1016/j.ipm.2013.08.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:104 / 112
页数:9
相关论文
共 50 条
[41]   Software design patterns classification and selection using text categorization approach [J].
Hussain, Shahid ;
Keung, Jacky ;
Khan, Arif Ali .
APPLIED SOFT COMPUTING, 2017, 58 :225-244
[42]   An improved term weighting method based on relevance frequency for text classification [J].
Li, Chuanxiao ;
Li, Wenqiang ;
Tang, Zhong ;
Li, Song ;
Xiang, Hai .
SOFT COMPUTING, 2023, 27 (07) :3563-3579
[43]   Modified frequency-based term weighting schemes for text classification [J].
Sabbah, Thabit ;
Selamat, Ali ;
Selamat, Md Hafiz ;
Al-Anzi, Fawaz S. ;
Viedma, Enrique Herrera ;
Krejcar, Ondrej ;
Fujita, Hamido .
APPLIED SOFT COMPUTING, 2017, 58 :193-206
[44]   A new feature selection method for handling redundant information in text classification [J].
Wang, You-wei ;
Feng, Li-zhou .
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2018, 19 (02) :221-234
[45]   Text Preprocessing Approaches in CNN for Disaster Reports Dataset [J].
Arisha, Andriansyah Oktafiandi ;
Hazriani ;
Wabula, Yuyun .
2023 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION, ICAIIC, 2023, :216-220
[46]   TEPROLIN: AN EXTENSIBLE, ONLINE TEXT PREPROCESSING PLATFORM FOR ROMANIAN [J].
Ion, Radu .
PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR PROCESSING THE ROMANIAN LANGUAGE, 2018, :69-76
[47]   Review of short-text classification [J].
Alsmadi, Issa ;
Gan, Keng Hoon .
INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2019, 15 (02) :155-182
[48]   An Improvement to Naive Bayes for Text Classification [J].
Zhang, Wei ;
Gao, Feng .
CEIS 2011, 2011, 15
[49]   Research On Emergency Event Text Classification [J].
Wang, Yuguang ;
Wang, Wenjun ;
He, Ruifang .
2010 INTERNATIONAL CONFERENCE ON COMMUNICATION AND VEHICULAR TECHNOLOGY (ICCVT 2010), VOL I, 2010, :186-189
[50]   Feature selection for text classification: A review [J].
Deng, Xuelian ;
Li, Yuqing ;
Weng, Jian ;
Zhang, Jilian .
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (03) :3797-3816