t-Test feature selection approach based on term frequency for text categorization

被引:92
作者
Wang, Deqing [1 ,3 ]
Zhang, Hui [2 ,3 ]
Liu, Rui [2 ,3 ]
Lv, Weifeng [2 ,3 ]
Wang, Datao [4 ]
机构
[1] Beihang Univ, Sch Econ & Management, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
[3] Beihang Univ, Natl Engn Res Ctr, S&T Resources Sharing Serv, Beijing 100191, Peoples R China
[4] Natl Audit Off, Jinan Resident Off China, Jinan, Peoples R China
关键词
Feature selection; Term frequency; Student t-test; Text classification; INFORMATION;
D O I
10.1016/j.patrec.2014.02.013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection techniques play an important role in text categorization (TC), especially for the largescale TC tasks. Many new and improved methods have been proposed, and most of them are based on document frequency, such as the famous Chi-square statistic and information gain etc. These methods based on document frequency, however, have two shortcomings: (1) they are not reliable for low-frequency terms, that is, low-frequency terms will be filtered because of their smaller weights; and (2) they only count whether one term occurs within a document and ignore term frequency. Actually, high-frequency term (except stop words) occurred in few documents is often regards as a discriminators in the real-life corpus. Aimed at solving the above drawbacks, the paper focuses on how to construct a feature selection function based on term frequency, and proposes a new approach using student t-test. The t-test function is used to measure the diversity of the distributions of a term frequency between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that the proposed approach is comparable to the state-of-the-art feature selection methods in terms of macro-F1 and micro-F1. Especially on micro-F1, our method achieves slightly better performance on Reuters with kNN and SVMs classifiers, compared to x(2), and IG. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 40 条
[1]  
[Anonymous], 1908, BIOMETRIKA, V6, P1
[2]  
[Anonymous], P 14 INT C MACH LEAR
[3]  
Billingsley P., 1995, PROBABILITY MEASURE, P357
[4]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[5]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[6]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[7]   Predicting bugs' components via mining bug reports [J].
Wang, Deqing ;
Zhang, Hui ;
Liu, Rui ;
Lin, Mengxiang ;
Wu, Wenjun .
Journal of Software, 2012, 7 (05) :1149-1154
[8]  
Devore J., 1997, Statistics: The exploration and analysis of data, V3rd
[9]  
Dunning T., 1993, Computational Linguistics, V19, P61
[10]  
E-H Han, 2000, P PKDD