The impact of preprocessing on text classification

被引:373
作者
Uysal, Alper Kursat [1 ]
Gunal, Serkan [1 ]
机构
[1] Anadolu Univ, Dept Comp Engn, Eskisehir, Turkey
关键词
Pattern recognition; Text categorization; Text classification; Text preprocessing; FEATURE-SELECTION; ALGORITHM; MODEL;
D O I
10.1016/j.ipm.2013.08.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:104 / 112
页数:9
相关论文
共 50 条
  • [31] Unsupervised Feature Selection for Text Classification via Word Embedding
    Rui, Weikang
    Liu, Jinwen
    Jia, Yawei
    [J]. PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA ANALYSIS (ICBDA), 2016, : 37 - 41
  • [32] Simple yet Effective Classification Model for Skewed Text Categorization
    Suhil, Mahamad
    Guru, D. S.
    Raju, Lavanya Narayana
    Gowda, Harsha S.
    [J]. 2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 904 - 910
  • [33] Feature selection methods for text classification: a systematic literature review
    Pintas, Julliano Trindade
    Fernandes, Leandro A. F.
    Garcia, Ana Cristina Bicharra
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2021, 54 (08) : 6149 - 6200
  • [34] Feature selection based on absolute deviation factor for text classification
    Jin, Lingbin
    Zhang, Li
    Zhao, Lei
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (03)
  • [35] ARTC: feature selection using association rules for text classification
    Saeed, Mozamel M.
    Al Aghbari, Zaher
    [J]. NEURAL COMPUTING & APPLICATIONS, 2022, 34 (24) : 22519 - 22529
  • [36] On Two-Stage Feature Selection Methods for Text Classification
    Uysal, Alper Kursat
    [J]. IEEE ACCESS, 2018, 6 : 43233 - 43251
  • [37] The impact of preprocessing on word embedding quality: a comparative study
    Zahra Rahimi
    Mohammad Mehdi Homayounpour
    [J]. Language Resources and Evaluation, 2023, 57 : 257 - 291
  • [38] The impact of preprocessing on word embedding quality: a comparative study
    Rahimi, Zahra
    Homayounpour, Mohammad Mehdi
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2023, 57 (01) : 257 - 291
  • [39] TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring
    Yousef, Malik
    Voskergian, Daniel
    [J]. FRONTIERS IN GENETICS, 2022, 13
  • [40] Category Discrimination Based Feature Selection Algorithm in Chinese Text Classification
    Yi, Junkai
    Yang, Guang
    Wan, Jing
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2016, 32 (05) : 1145 - 1159