Comparison of text feature selection policies and using an adaptive framework

被引:47
作者
Tasci, Serafettin [1 ]
Gungor, Tunga [1 ]
机构
[1] Bogazici Univ, Dept Comp Engn, TR-34342 Istanbul, Turkey
关键词
Document categorization; Feature selection; Local and global policies; Adaptive keyword selection; Support vector machines;
D O I
10.1016/j.eswa.2013.02.019
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, often feature selection is used for reducing the dimensionality. In this paper, we make an evaluation and comparison of the feature selection policies used in text categorization by employing some of the popular feature selection metrics. For the experiments, we use datasets which vary in size, complexity, and skewness. We use support vector machine as the classifier and tf-idf weighting for weighting the terms. In addition to the evaluation of the policies, we propose new feature selection metrics which show high success rates especially with low number of keywords. These metrics are two-sided local metrics and are based on the difference of the distributions of a term in the documents belonging to a class and in the documents not belonging to that class. Moreover, we propose a keyword selection framework called adaptive keyword selection. It is based on selecting different number of terms for each class and it shows significant improvement on skewed datasets that have a limited number of training instances for some of the classes. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:4871 / 4886
页数:16
相关论文
共 37 条
  • [1] [Anonymous], 1997, READINGS INFORM RETR
  • [2] Higher order feature selection for text classification
    Bakus, J
    Kamel, MS
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2006, 9 (04) : 468 - 491
  • [3] Remote Sensing Feature Selection by Kernel Dependence Measures
    Camps-Valls, Gustavo
    Mooij, Joris
    Schoelkopf, Bernhard
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2010, 7 (03) : 587 - 591
  • [4] Chawla N. V., 2004, ACM SIGKDD Explorations Newsletter, V6, P1
  • [5] Chen M., 2008, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P124, DOI DOI 10.1145/1401890.1401910
  • [6] Multi-class feature selection for texture classification
    Chen, Xue-wen
    Zeng, Xiangyan
    van Alphen, Deborah
    [J]. PATTERN RECOGNITION LETTERS, 2006, 27 (14) : 1685 - 1691
  • [7] Dasgupta A, 2007, KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, P230
  • [8] DEBOLE F, 2003, P SAC 03 18 ACM S AP, P784
  • [9] Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
  • [10] FORMAN G, 2004, P ICML 04 21 INT C M, P297