Feature selection via maximizing global information gain for text classification

被引:100
作者
Shang, Changxing [1 ,2 ,3 ]
Li, Min [1 ,2 ]
Feng, Shengzhong [1 ]
Jiang, Qingshan [1 ]
Fan, Jianping [1 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 518055, Peoples R China
[2] Chinese Acad Sci, Grad Sch, Beijing 100080, Peoples R China
[3] Zhengzhou Inst Informat Sci & Technol, Zhengzhou 450001, Peoples R China
基金
美国国家科学基金会;
关键词
Feature selection; Text classification; High dimensionality; Distributional clustering; Information bottleneck;
D O I
10.1016/j.knosys.2013.09.019
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is a vital preprocessing step for text classification task used to solve the curse of dimensionality problem. Most existing metrics (such as information gain) only evaluate features individually but completely ignore the redundancy between them. This can decrease the overall discriminative power because one feature's predictive power is weakened by others. On the other hand, though all higher order algorithms (such as mRMR) take redundancy into account, the high computational complexity renders them improper in the text domain. This paper proposes a novel metric called global information gain (GIG) which can avoid redundancy naturally. An efficient feature selection method called maximizing global information gain (MGIG) is also given. We compare MGIG with four other algorithms on six datasets, the experimental results show that MGIG has better results than others methods in most cases. Moreover, MGIG runs significantly faster than the traditional higher order algorithms, which makes it a proper choice for feature selection in text domain. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:298 / 309
页数:12
相关论文
共 30 条
[1]  
[Anonymous], 1993, 31 ANN M ASS COMPUTA, DOI [10.3115/981574.981598, DOI 10.3115/981574.981598]
[2]  
[Anonymous], 1999, The Nature Statist. Learn. Theory
[3]  
[Anonymous], 2009, P 12 INT C ART INT S
[4]  
[Anonymous], 1997, ICML
[5]  
[Anonymous], 1998, LEARNING TEXT CATEGO
[6]  
[Anonymous], 1999, ALL C COMM CONTR COM
[7]  
[Anonymous], 2009, NATURAL LANGUAGE PRO, DOI DOI 10.1007/S10579-010-9124-X
[8]  
Bekkerman R., 2003, Journal of Machine Learning Research, V3, P1183, DOI 10.1162/153244303322753625
[9]  
Brown G, 2012, J MACH LEARN RES, V13, P27
[10]  
Cover T.M., 2006, ELEMENTS INFORM THEO, V2nd ed