A new feature selection metric for text classification: eliminating the need for a separate pruning stage

被引:9
作者
Asim, Muhammad [1 ]
Javed, Kashif [2 ]
Rehman, Abdur [3 ]
Babri, Haroon A. [2 ]
机构
[1] Riphah Int Univ, Dept Elect Engn, Lahore, Pakistan
[2] Univ Engn & Technol, Dept Elect Engn, Lahore, Pakistan
[3] Univ Gujrat, Dept Comp Sci, Gujrat, Pakistan
关键词
Text classification; Feature selection; Feature ranking metrics; Pruning; INFORMATION; PERFORMANCE; ALGORITHM; CRITERIA; IMPACT;
D O I
10.1007/s13042-021-01324-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Terms that occur too frequently or rarely in various texts are not useful for text classification. Pruning can be used to remove such irrelevant terms reducing the dimensionality of the feature space and, thus making feature selection more efficient and effective. Normally, pruning is achieved by manually setting threshold values. However, incorrect threshold values can result in the loss of many useful terms or retention of irrelevant ones. Existing feature ranking metrics can assign higher ranks to these irrelevant terms, thus degrading the performance of a text classifier. In this paper, we propose a new feature ranking metric, which can select the most useful terms in the presence of these too frequently and rarely occurring terms, thus eliminating the need for pruning these terms. To investigate the usefulness of the proposed metric, we compare it against seven well-known feature selection metrics on five data sets namely Reuters-21578 (re0, re1, r8) and WebACE (k1a, k1b) using multinomial naive Bayes and support vector machines classifiers. Our results based on a paired t-test show that the performance of our metric is statistically significant than that of the other seven metrics.
引用
收藏
页码:2461 / 2478
页数:18
相关论文
共 70 条
[1]  
Adiwijaya Wisesty, 2019, J DATA SCI APPL, V2, P85
[2]  
Aggarwal C. C, 2012, MINING TEXT DATA, DOI DOI 10.1007/978-1-4614-3223-4
[3]   Variable Global Feature Selection Scheme for automatic classification of text documents [J].
Agnihotri, Deepak ;
Verma, Kesari ;
Tripathi, Priyanka .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 81 :268-281
[4]   A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification [J].
Ali, Muhammad Sajid ;
Javed, Kashif .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2020, 45 (12) :10471-10491
[5]  
Anitha, 2011, INT J COMPUT SCI APP, V47, P49
[6]  
[Anonymous], 2004, SIGKDD Explor. Newsl, DOI [10.1145/1007730.1007741, DOI 10.1145/1007730.1007741]
[7]  
Asim M., 2018, INT J COMPUT APPL, V179, P6, DOI [10.5120/ijca2018916555, DOI 10.5120/IJCA2018916555]
[8]   Effective Text Classification by a Supervised Feature Selection Approach [J].
Basu, Tanmay ;
Murthy, C. A. .
12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2012), 2012, :918-925
[9]   Feature selection for high-dimensional data [J].
Bolón-Canedo V. ;
Sánchez-Maroño N. ;
Alonso-Betanzos A. .
Progress in Artificial Intelligence, 2016, 5 (02) :65-75
[10]  
Cardoso-Cachopo A., 2007, PdD Thesis