Comparison of feature selection methods in Kurdish text classification

被引:0
作者
Ari M. Saeed
Soran Badawi
Sara A. Ahmed
Diyari A. Hassan
机构
[1] Computer Science Department, University of Halabja, KRG, Kurdistan, Halabja
[2] Language Center, Charmo University, KRG, Chamchamal, Kurdistan
[3] Department of Computer Science, Komar University of Science and Technology, Kurdistan Region, Sulaymaniyah
[4] Faculty of Engineering and Computer Science, Qaiwan International University, Kurdistan Region, Sulaymaniyah
关键词
Feature selection; Kurdish language; Multinomial naive Bayes; Support vector machine; Text classification;
D O I
10.1007/s42044-023-00159-4
中图分类号
学科分类号
摘要
The aim of this study is to investigate the impact of feature selection (FS) on the performance of classifiers for text classification (TC) in Kurdish. TC accuracy can be adversely affected by the high dimensionality of the feature space. Hence, FS is employed to reduce the feature space and improve accuracy. This study evaluates several FS methods, including discriminative feature selection (DFSS), Chi-squared (CHI2), Discriminative power measure (DPM), Gini index, Distinguishing feature selector (DFS), Comprehensively measure feature selection (CMFS), and Correlation coefficient (CC), on two Kurdish datasets (KDC-4007 and KNDH). Multinomial naive Bayes (MNB) and Support vector machines (SVMs) are employed to evaluate the accuracy and F measure of FS. The experiment tests nine subsets of features (50, 100, 250, 500, 750, 1000, 2000, 3000, and 4000). The study finds that the FS methods CHI2 and DPM exhibit superior F measure and accuracy scores for SVM, while the CHI2 and CMF methods are superior for MNB. Importantly, most FS methods have only been applied to English texts, with little or no investigation of the Kurdish language. Therefore, this study fills an important gap in the literature by evaluating the effectiveness of various FS methods for Kurdish language TC. © The Author(s), under exclusive licence to Springer Nature Switzerland AG 2023.
引用
收藏
页码:55 / 64
页数:9
相关论文
共 30 条
  • [1] Parlak B., Uysal A.K., A novel filter feature selection method for text classification: extensive feature selector, J Inf Sci, (2021)
  • [2] Onan A., An ensemble scheme based on language function analysis and feature engineering for text genre classification, J Inf Sci, 44, 1, pp. 28-47, (2018)
  • [3] Amir Sjarif N.N., Mohd Azmi N.F., Chuprat S., Sarkan H.M., Yahya Y., Sam S.M., SMS spam message detection using term frequency-inverse document frequency and random forest algorithm, Procedia Comput Sci, 161, pp. 509-515, (2019)
  • [4] Gupta S.T., Sahoo J.K., Roul R.K., Authorship identification using recurrent neural networks, Proceedings of the 2019 3Rd International Conference on Information System and Data Mining—ICISDM 2019, Pp. 133–137 (2019
  • [5] Chang Y.-C., Hsieh Y.-L., Chen C.-C., Hsu W.-L., A semantic frame-based intelligent agent for topic detection, Soft comput, 21, 2, pp. 391-401, (2017)
  • [6] Parlak B., Uysal A.K., Classification of medical documents according to diseases, 23Nd Signal Processing and Communications Applications Conference (SIU), pp. 1635-1638, (2015)
  • [7] Onan A., Classifier and feature set ensembles for web page classification, J Inf Sci, 42, 2, pp. 150-165, (2016)
  • [8] Erenel Z., Adegboye O.R., Kusetogullari H., A new feature selection scheme for emotion recognition from text, Appl Sci (Switz), (2020)
  • [9] Mironczuk M.M., Protasiewicz J., A recent overview of the state-of-the-art elements of text classification, Expert Syst Appl, 106, pp. 36-54, (2018)
  • [10] Lan M., Tan C.L., Su J., Low H.B., Text representations for text categorization: A case study in biomedical domain, International Joint Conference on Neural Networks, pp. 2557-2562, (2007)