Two novel term weighting for text categorization

被引:0
作者
Matsunaga, L. A.
Ebecken, N. F. F.
机构
来源
DATA MINING IX: DATA MINING, PROTECTION, DETECTION AND OTHER SECURITY TECHNOLOGIES | 2008年 / 40卷
关键词
term weighting; text categorization; text classification;
D O I
10.2495/DATA080111
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In text categorization (TC) based on the vector space model, documents are represented as a vector, where each component is associated with a particular term from the text collection vocabulary. Traditionally, each component value is assigned using the information retrieval (IR) TFIDF measure. While this weighting method seems very appropriate for IR, weighting methods that take into account the importance of the term to the discrimination of the categories may provide better results in TC. To apply this idea, we use in this work variants of TFIDF weighting, where the Of part is replaced by functions used to conduct term selection. In an approach on real-world data to automatically distribute the legislative bills to the committees at the Federal District Legislative Assembly in Brasilia, Brazil, the replacement of the Of part in TFIDF by a new term selection measure - absl-logit - and by bi-normal separation [1] produced the best general classification results with support vector machines (SVM), when compared with TFIDF and with the use of common term selection measures - chi-square, information gain, gain ratio and odds ratio - to replace the idf part in TFIDF.
引用
收藏
页码:105 / 114
页数:10
相关论文
共 13 条
[1]   On logit confidence intervals for the odds ratio with small samples [J].
Agresti, A .
BIOMETRICS, 1999, 55 (02) :597-602
[2]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[3]   An analysis of the relative hardness of Reuters-21578 subsets [J].
Debole, F ;
Sebastiani, F .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2005, 56 (06) :584-596
[4]  
Debole F., 2003, PROCEEDING 18 ACM S, P784, DOI 10.1145/ 952532.952688
[5]  
Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
[6]  
Joachims J., 1999, ADV KERNEL METHODS S
[7]  
Lan M., 2006, ASS ADV ARTIFICIAL I, V6, P763
[8]   Text categorization with support vector machines.: How to represent texts in input space? [J].
Leopold, E ;
Kindermann, J .
MACHINE LEARNING, 2002, 46 (1-3) :423-444
[9]  
Lewis DD, 2004, J MACH LEARN RES, V5, P361
[10]  
MATSUNAGA L, 2007, AUTOMATED TEXT CATEG