Categorical Term Frequency Probability Based Feature Selection for Document Categorization
被引:0
作者:
Li, Qiang
论文数: 0引用数: 0
h-index: 0
机构:
East China Normal Univ, Dept Comp Sci & Technol, Shanghai, Peoples R ChinaEast China Normal Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China
Li, Qiang
[1
]
He, Liang
论文数: 0引用数: 0
h-index: 0
机构:
East China Normal Univ, Dept Comp Sci & Technol, Shanghai, Peoples R ChinaEast China Normal Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China
He, Liang
[1
]
Lin, Xin
论文数: 0引用数: 0
h-index: 0
机构:
East China Normal Univ, Dept Comp Sci & Technol, Shanghai, Peoples R ChinaEast China Normal Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China
Lin, Xin
[1
]
机构:
[1] East China Normal Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China
来源:
2013 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR)
|
2013年
关键词:
term frequency;
feature selection;
variance mean;
document categorization;
categorical distribution;
D O I:
暂无
中图分类号:
TP18 [人工智能理论];
学科分类号:
081104 ;
0812 ;
0835 ;
1405 ;
摘要:
Document categorization technology heavily relies on the categorical distribution of features. Those terms which occur unevenly in various categories have strong distinguishable information as to categorization. At first, we give the definition of CTFP (Categorical Term Frequency Probability), which will be used to accurately reflect the categorical characteristics of terms on each category. Then, the CTFP_VM (Variance-Mean based on CTFP) feature selection criterion is introduced to reveal the category distribution difference. After computing and ranking the variance mean based on CTFP distribution for each term, feature sets are obtained for document categorization. We perform the document categorization experiments on SVM classifiers with the well-known Reuters-21578 and 20news-18828 corpuses as unbalanced and balanced corpus respectively. Experiments compare the novel methods with other conventional feature selection algorithms and the proposed method achieves the best feature set for document categorization The experimental results also demonstrate that the proposed variance mean feature selection method base on CTFP not only has better F1-metric for document categorization but excellent corpus adaptability.
引用
收藏
页码:60 / 65
页数:6
相关论文
共 11 条
[11]
Zhilong Zhen, 2011, 2011 International Conference of Soft Computing and Pattern Recognition, P440, DOI 10.1109/SoCPaR.2011.6089284