The effects of globalisation techniques on feature selection for text classification

被引:19
作者
Parlak, Bekir [1 ]
Uysal, Alper Kursat [1 ]
机构
[1] Eskisehir Tech Univ, Fac Engn, Dept Comp Engn, TR-26470 Eskisehir, Turkey
关键词
Feature selection; globalisation techniques; text classification; SCHEME;
D O I
10.1177/0165551520930897
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification (TC) is very important and critical task in the 21th century as there exist high volume of electronic data on the Internet. In TC, textual data are characterised by a huge number of highly sparse features/terms. A typical TC consists of many steps and one of the most important steps is undoubtedly feature selection (FS). In this study, we have comprehensively investigated the effects of various globalisation techniques on local feature selection (LFS) methods using datasets with different characteristics such as multi-class unbalanced (MCU), multi-class balanced (MCB), binary-class unbalanced (BCU) and binary-class balanced (BCB). The globalisation techniques used in this study are summation (SUM), weighted-sum (AVG), and maximum (MAX). To investigate the effect of globalisation techniques, we used three LFS methods named as Discriminative Feature Selection (DFSS), odds ratio (OR) and chi-square (CHI2). In the experiments, we have utilised four different benchmark datasets named as Reuters-21578, 20Newsgroup., Enron1, and Polarity in addition to Support Vector Machines (SVM) and Decision Tree (DT) classifiers. According to the experimental results, the most successful globalisation technique is AVG while all situations are taken into account. The experimental results indicate that DFSS method is more successful than OR and CHI2 methods on datasets with MCU and MCB characteristics. However, CHI2 method seems more accurate than OR and DFSS methods on datasets with BCU and BCB characteristics. Also, SVM classifier performed better than DT classifier in most cases.
引用
收藏
页码:727 / 739
页数:13
相关论文
共 30 条
[1]  
Aggarwal CharuC., 2012, MINING TEXT DATA, DOI [10.1007/978-1-4614-3223-46, DOI 10.1007/978-1-4614-3223-4.6]
[2]   Variable Global Feature Selection Scheme for automatic classification of text documents [J].
Agnihotri, Deepak ;
Verma, Kesari ;
Tripathi, Priyanka .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 81 :268-281
[3]  
Bhowmick A., 2018, Advances in Electronics, Communication and Computing, V443, P583
[4]   Task-Guided and Path-Augmented Heterogeneous Network Embedding for Author Identification [J].
Chen, Ting ;
Sun, Yizhou .
WSDM'17: PROCEEDINGS OF THE TENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2017, :295-304
[5]  
Debole F, 2004, STUD FUZZ SOFT COMP, V138, P81
[6]  
Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
[7]   The search for optimal feature set in power quality event classification [J].
Gunal, Serkan ;
Gerek, Omer Nezih ;
Ece, Dogan Gokhan ;
Edizkan, Rifat .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (07) :10266-10273
[8]  
Guyon I, 2003, J. Mach. Learn. Res, DOI [DOI 10.1162/153244303322753616, 10.5555/944919.944968]
[9]  
Jauhiainen T, 2019, J ARTIF INTELL RES, V65, P675
[10]  
Joachims T., 1998, EUROPEAN C MACHINE L, P137, DOI DOI 10.1007/S13928716