Categorical Term Frequency Probability Based Feature Selection for Document Categorization

被引:0
作者
Li, Qiang [1 ]
He, Liang [1 ]
Lin, Xin [1 ]
机构
[1] East China Normal Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China
来源
2013 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR) | 2013年
关键词
term frequency; feature selection; variance mean; document categorization; categorical distribution;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document categorization technology heavily relies on the categorical distribution of features. Those terms which occur unevenly in various categories have strong distinguishable information as to categorization. At first, we give the definition of CTFP (Categorical Term Frequency Probability), which will be used to accurately reflect the categorical characteristics of terms on each category. Then, the CTFP_VM (Variance-Mean based on CTFP) feature selection criterion is introduced to reveal the category distribution difference. After computing and ranking the variance mean based on CTFP distribution for each term, feature sets are obtained for document categorization. We perform the document categorization experiments on SVM classifiers with the well-known Reuters-21578 and 20news-18828 corpuses as unbalanced and balanced corpus respectively. Experiments compare the novel methods with other conventional feature selection algorithms and the proposed method achieves the best feature set for document categorization The experimental results also demonstrate that the proposed variance mean feature selection method base on CTFP not only has better F1-metric for document categorization but excellent corpus adaptability.
引用
收藏
页码:60 / 65
页数:6
相关论文
共 11 条
  • [1] Chowdhury GG., 2010, Introduction to Modern Information Retrieval
  • [2] [代六玲 Dai Liuling], 2004, [中文信息学报, Journal of Chinese Information Processing], V18, P26
  • [3] Gao Guanyu, 2012 INT C SYST INF, P2247
  • [4] Li Yanling, 2008, FUZZ SYST KNOWL DISC, V2
  • [5] ON RELEVANCE, PROBABILISTIC INDEXING AND INFORMATION RETRIEVAL
    MARON, ME
    KUHNS, JL
    [J]. JOURNAL OF THE ACM, 1960, 7 (03) : 216 - 244
  • [6] Improvement of Text Feature Selection Method based on TFIDF
    Qu, Shouning
    Wang, Sujuan
    Zou, Yan
    [J]. 2008 INTERNATIONAL SEMINAR ON FUTURE INFORMATION TECHNOLOGY AND MANAGEMENT ENGINEERING, PROCEEDINGS, 2008, : 79 - 81
  • [7] STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL
    SPARCKJONES, K
    [J]. JOURNAL OF DOCUMENTATION, 1972, 28 (01) : 11 - +
  • [8] Uysal Alper Kursat, 2012, KNOWLEDGE BASED SYST
  • [9] YANG Y, P 14 INT C MACH LEAR, P412
  • [10] Zhen Z., INF TECHN COMP ENG M, V2, P65