An optimal feature selection method for text classification through redundancy and synergy analysis

被引:0
作者
Farek L. [1 ,3 ]
Benaidja A. [2 ,3 ]
机构
[1] Computer Science Department, University of Guelma, Guelma
[2] Computer Science Department, University of Setif 1, Setif
[3] Laboratory of Vision and Artificial Intelligence (LAVIA), Larbi Tebessi University, Tebessa
关键词
Feature correlation; Feature redundancy; Feature selection; Synergistic information; Text classification;
D O I
10.1007/s11042-024-19736-1
中图分类号
学科分类号
摘要
Feature selection is an essential step in text classification tasks to enhance model performance, reduce computational complexity, and mitigate the risk of overfitting. Filter-based methods have gained popularity for their effectiveness and efficiency in selecting informative features. However, these methods often overlook feature correlations, resulting in the selection of redundant and irrelevant features while underestimating others. To address this limitation, this paper proposes FS-RSA (Feature Selection through Redundancy and Synergy Analysis), a novel method for text classification. FS-RSA aims to identify an optimal feature subset by considering feature interactions at a lower computational cost. It achieves this by evaluating features to optimize their synergy information and minimize redundancy within small subsets. The core principle of FS-RSA is that features offering similar classification information to the class variable are likely to be correlated and redundant, whereas features with high and low classification information can provide synergistic information. In the conducted experiments on five public datasets, FS-RSA was compared to five effective filter-based methods in text classification. It consistently achieved higher F1 scores with NB and SVM classifiers, highlighting its effectiveness in feature selection while significantly reducing dimensionality. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
页码:16397 / 16423
页数:26
相关论文
共 48 条
  • [1] Ashokkumar P., Srivastava G., Et al., A Two-stage Text Feature Selection Algorithm for Improving Text Classification, ACM Trans Asian Low-Resour Lang Inf Process, 20, pp. 1-19, (2021)
  • [2] Apte C., Damerau F., Weiss S.M., Automated learning of decision rules for text categorization, ACM Trans Inf Syst, 12, pp. 233-251, (1994)
  • [3] Basu A., Watters C., Shepherd M., Support vector machines for text categorization, 36Th Annual Hawaii International Conference on System Sciences, (2003)
  • [4] Battiti R., Using mutual information for selecting features in supervised neural net learning, IEEE Trans Neural Netw, 5, 4, pp. 537-550, (1994)
  • [5] Behera S.K., Dash R., A novel feature selection technique for enhancing performance of unbalanced text classification problem, IDT, 16, pp. 51-69, (2022)
  • [6] Bell A.J., The co-information lattice. 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, Japan, 2003, pp. 921-926, (2003)
  • [7] Bennasar M., Hicks Y., Setchi R., Feature selection using Joint Mutual Information Maximisation, Expert Syst Appl, 42, pp. 8520-8532, (2015)
  • [8] Chechik G., Globerson A., Anderson M.J., Et al., Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway, Advances in neural information processing systems 14, pp. 173-180, (2002)
  • [9] Chen J., Huang H., Tian S., Qu Y., Feature selection for text classification with Naïve Bayes, Expert Sys Appl, 36, 3, pp. 5432-5435, (2009)
  • [10] Chen Y., Han B., Hou P., New feature selection methods based on context similarity for text categorization, Proceedings of the International Conference on Fuzzy Systems and Knowledge Discovery, (2014)