On Two-Stage Feature Selection Methods for Text Classification

被引:34
作者
Uysal, Alper Kursat [1 ]
机构
[1] Anadolu Univ, Dept Comp Engn, TR-26555 Eskisehir, Turkey
关键词
Feature selection; genetic algorithms; LSI; PCA; text classification; IDENTIFICATION; ALGORITHM;
D O I
10.1109/ACCESS.2018.2863547
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification is a high dimensional pattern recognition problem where feature selection is an important step. Although researchers still propose new feature selection methods, there exist many two-stage feature selection methods combining existing filter-based feature selection methods with feature transformation and wrapper-based feature selection methods in different ways. The main focus of the study is to extensively analyze two-stage feature selection methods for text classification from a different point of view. Two-stage feature selection methods that are constituted by combining filter-based local feature selection methods with feature transformation and wrapper-based feature selection methods were investigated in this paper. In the first stage, four different filter-based local feature selection methods and three different feature set construction methods were employed. Feature sets were constructed either by using maximum globalization policy (MAX), by using weighted averaging globalization policy (AVG), or by selecting an equal number of features for each class (EQ). In the second stage, principal component analysis (PCA), latent semantic indexing (LSI), or genetic algorithms were utilized. Various settings were evaluated with a linear support vector machines classifier on two benchmark data sets, namely, Reuters and Ohsumed using Micro-Fl and Macro-Fl scores. According to the findings, AVG and EQ feature set construction methods are usually more successful than MAX method for two-stage feature selection methods. Most of the highest accuracies were obtained by employing PCA feature transformation in the second stage. However, there is a strong linear correlation between PCA and LSI for all settings but the degree of correlation is slightly more for Ohsumed data set in comparison with the Reuters data set.
引用
收藏
页码:43233 / 43251
页数:19
相关论文
共 35 条
[1]   Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering [J].
Almeida, Tiago A. ;
Silva, Tiago P. ;
Santos, Igor ;
Gomez Hidalgo, Jose M. .
KNOWLEDGE-BASED SYSTEMS, 2016, 108 :25-32
[2]  
Asuncion A., 2007, UCI MACHINE LEARNING
[3]   A sentiment classification model based on multiple classifiers [J].
Catal, Cagatary ;
Nangir, Mehmet .
APPLIED SOFT COMPUTING, 2017, 50 :135-141
[4]   Spam filtering for short messages in adversarial environment [J].
Chan, Patrick P. K. ;
Yang, Cheng ;
Yeung, Daniel S. ;
Ng, Wing W. Y. .
NEUROCOMPUTING, 2015, 155 :167-176
[5]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[6]   Adapting sentiment lexicons to domain-specific social media texts [J].
Deng, Shuyuan ;
Sinha, Atish P. ;
Zhao, Huimin .
DECISION SUPPORT SYSTEMS, 2017, 94 :65-76
[7]   Hybrid feature selection for text classification [J].
Gunal, Serkan .
TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2012, 20 :1296-1311
[8]  
Guyon I., 2003, INTRO VARIABLE FEATU
[9]  
guz H. U, 2013, SCI RES ESSAYS, V8, P1818
[10]  
Haltas A, 2015, J FAC ENG ARCHIT GAZ, V30, P417