An optimal feature selection method for text classification through redundancy and synergy analysis

被引:0
作者
Farek L. [1 ,3 ]
Benaidja A. [2 ,3 ]
机构
[1] Computer Science Department, University of Guelma, Guelma
[2] Computer Science Department, University of Setif 1, Setif
[3] Laboratory of Vision and Artificial Intelligence (LAVIA), Larbi Tebessi University, Tebessa
关键词
Feature correlation; Feature redundancy; Feature selection; Synergistic information; Text classification;
D O I
10.1007/s11042-024-19736-1
中图分类号
学科分类号
摘要
Feature selection is an essential step in text classification tasks to enhance model performance, reduce computational complexity, and mitigate the risk of overfitting. Filter-based methods have gained popularity for their effectiveness and efficiency in selecting informative features. However, these methods often overlook feature correlations, resulting in the selection of redundant and irrelevant features while underestimating others. To address this limitation, this paper proposes FS-RSA (Feature Selection through Redundancy and Synergy Analysis), a novel method for text classification. FS-RSA aims to identify an optimal feature subset by considering feature interactions at a lower computational cost. It achieves this by evaluating features to optimize their synergy information and minimize redundancy within small subsets. The core principle of FS-RSA is that features offering similar classification information to the class variable are likely to be correlated and redundant, whereas features with high and low classification information can provide synergistic information. In the conducted experiments on five public datasets, FS-RSA was compared to five effective filter-based methods in text classification. It consistently achieved higher F1 scores with NB and SVM classifiers, highlighting its effectiveness in feature selection while significantly reducing dimensionality. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
页码:16397 / 16423
页数:26
相关论文
共 48 条
  • [21] Kolluri J., Razia S., WITHDRAWN: Text classification using Naïve Bayes classifier, Materials Today: Proceedings, (2020)
  • [22] Kumar V., Feature Selection: A literature Review, SmartCR, 4, (2014)
  • [23] Lei S., ) A Feature Selection Method Based on Information Gain and Genetic Algorithm, In: 2012 International Conference on Computer Science and Electronics Engineering. IEEE, pp. 355-358, (2012)
  • [24] Liu X., Wang S., Lu S., Et al., Adapting Feature Selection Algorithms for the Classification of Chinese Texts, Systems, 11, (2023)
  • [25] Mamdouh Farghaly H., Abd El-Hafeez T., A high-quality feature selection method based on frequent and correlated items for text classification, Soft Comput, 27, pp. 11259-11274, (2023)
  • [26] Mao K.Z., Orthogonal forward selection and backward elimination algorithms for feature subset selection, IEEE Trans Syst Man Cybernet Part B (Cybernetics), 34, pp. 629-634, (2004)
  • [27] McGill W.J., Multivariate information transmission, Psychometrika, 19, pp. 97-116, (1954)
  • [28] Miri M., Dowlatshahi M.B., Hashemi A., Et al., Ensemble feature selection for multi-label text classification: An intelligent order statistics approach, Int J of Intelligent Sys, 37, pp. 11319-11341, (2022)
  • [29] Ogura H., Amano H., Kondo M., Comparison of metrics for feature selection in imbalanced text classification, Expert Syst Appl, 38, pp. 4978-4989, (2011)
  • [30] Pang B., Lee L., A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, In: Proceedings of the 42Nd Annual Meeting on Association for Computational linguistics–ACL ACL ’04. Association for Computational Linguistics, pp. 271-es, (2004)