The impact of preprocessing on text classification

被引:373
作者
Uysal, Alper Kursat [1 ]
Gunal, Serkan [1 ]
机构
[1] Anadolu Univ, Dept Comp Engn, Eskisehir, Turkey
关键词
Pattern recognition; Text categorization; Text classification; Text preprocessing; FEATURE-SELECTION; ALGORITHM; MODEL;
D O I
10.1016/j.ipm.2013.08.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:104 / 112
页数:9
相关论文
共 50 条
  • [21] Impact of Feature Selection and Engineering in the Classification of Handwritten Text
    Kaushik, Anupama
    Gupta, Himanshu
    Latwal, Digvijay Singh
    PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 2598 - 2601
  • [22] Two new feature selection metrics for text classification
    Sahin, Durmus Ozkan
    Kilic, Erdal
    AUTOMATIKA, 2019, 60 (02) : 162 - 171
  • [23] Weighted Document Frequency for Feature Selection in Text Classification
    Li, Baoli
    Yan, Qiuling
    Xu, Zhenqiang
    Wang, Guicai
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 132 - 135
  • [24] Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model
    Rupapara, Vaibhav
    Rustam, Furqan
    Shahzad, Hina Fatima
    Mehmood, Arif
    Ashraf, Imran
    Choi, Gyu Sang
    IEEE ACCESS, 2021, 9 : 78621 - 78634
  • [25] A Novel Text Classification Technique Using Improved Particle Swarm Optimization: A Case Study of Arabic Language
    Alhaj, Yousif A.
    Dahou, Abdelghani
    Al-qaness, Mohammed A. A.
    Abualigah, Laith
    Abbasi, Aaqif Afzaal
    Almaweri, Nasser Ahmed Obad
    Abd Elaziz, Mohamed
    Damasevicius, Robertas
    FUTURE INTERNET, 2022, 14 (07):
  • [26] Evaluating preprocessing by Turing Machine in text categorization
    Ghalehtaki, Razieh Abbasi
    Khotanlou, Hassan
    Esmaeilpour, Mansour
    2014 IRANIAN CONFERENCE ON INTELLIGENT SYSTEMS (ICIS), 2014,
  • [27] Text Classification Algorithms: A Survey
    Kowsari, Kamran
    Meimandi, Kiana Jafari
    Heidarysafa, Mojtaba
    Mendu, Sanjana
    Barnes, Laura
    Brown, Donald
    INFORMATION, 2019, 10 (04)
  • [28] Feature Selection in Text Classification
    Sahin, Durmus Ozkan
    Ates, Nurullah
    Kilic, Erdal
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1777 - 1780
  • [29] Chinese Text Classification Review
    Ma, Yajing
    Li, Yonghong
    Zhang, Xiang
    Wu, Xiaolong
    2018 NINTH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY IN MEDICINE AND EDUCATION (ITME 2018), 2018, : 737 - 739
  • [30] Data preprocessing for heart disease classification: A systematic literature review
    Benhar, H.
    Idri, A.
    Fernandez-Aleman, J. L.
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2020, 195