A New Filter Feature Selection Method for Text Classification

被引:0
作者
Cekik, Rasim [1 ]
机构
[1] Sirnak Univ, Fac Engn, Dept Comp Engn, TR-73000 Sirnak, Turkiye
关键词
Feature selection; text classification; dimensionality reduction; text mining;
D O I
10.1109/ACCESS.2024.3468001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Massively amounts of text data have been created on the Internet due to the widespread use of platforms like social media. Text classification is one of the most frequently used techniques for extracting useful information from text data. One of the most fundamental problems in text classification is high dimensionality. In text classification, high dimensionality greatly reduces the success of classifiers while increasing their computational cost. The most effective way to overcome this problem is to select a subset of features comprising the most distinctive features across the entire feature space, with the help of a feature selector. This study presents a new filter feature selection approach called Multivariate Feature Selector (MFS) for text classification. The proposed approach calculates a score for each feature based on three knowledge structures: class-based, document-based, and document-class-based. These structures have been utilized to reveal hidden information at the class, document, and document-class levels. This enables a more precise and effective scoring calculation for each term. The proposed method (MFS) was tested on four different datasets, and micro-F1 and macro-F1 measures were used as performance evaluators to prove the method's success in feature selection. It has been observed that MFS outperforms the main feature selection methods in the literature. While different classification results were obtained depending on the selected feature size, MFS showed superior performance in all selected sub-feature spaces.
引用
收藏
页码:139316 / 139335
页数:20
相关论文
共 55 条
[1]   TubeSpam: Comment Spam Filtering on YouTube [J].
Alberto, Tulio C. ;
Lochter, Johannes V. ;
Almeida, Tiago A. .
2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, :138-143
[2]   A Bolasso based consistent feature selection enabled random forest classification algorithm: An application to credit risk assessment [J].
Arora, Nisha ;
Kaur, Pankaj Deep .
APPLIED SOFT COMPUTING, 2020, 86
[3]   A new feature selection metric for text classification: eliminating the need for a separate pruning stage [J].
Asim, Muhammad ;
Javed, Kashif ;
Rehman, Abdur ;
Babri, Haroon A. .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (09) :2461-2478
[4]   Recent advances and emerging challenges of feature selection in the context of big data [J].
Bolon-Canedo, V. ;
Sanchez-Marono, N. ;
Alonso-Betanzos, A. .
KNOWLEDGE-BASED SYSTEMS, 2015, 86 :33-45
[5]   Benchmark for filter methods for feature selection in high-dimensional classification data [J].
Bommert, Andrea ;
Sun, Xudong ;
Bischl, Bernd ;
Rahnenfuehrer, Joerg ;
Lang, Michel .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2020, 143
[6]   A novel filter feature selection method using rough set for short text data [J].
Cekik, Rasim ;
Uysal, Alper Kursat .
EXPERT SYSTEMS WITH APPLICATIONS, 2020, 160
[7]   Two novel feature selection approaches for web page classification [J].
Chen, Chih-Ming ;
Lee, Hahn-Ming ;
Chang, Yu-Jung .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (01) :260-272
[8]   Feature selection for text classification: A review [J].
Deng, Xuelian ;
Li, Yuqing ;
Weng, Jian ;
Zhang, Jilian .
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (03) :3797-3816
[9]  
Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
[10]   A high-quality feature selection method based on frequent and correlated items for text classification [J].
Farghaly, Heba Mamdouh ;
Abd El-Hafeez, Tarek .
SOFT COMPUTING, 2023, 27 (16) :11259-11274