Two novel term weighting for text categorization

被引：0

作者：

Matsunaga, L. A.

Ebecken, N. F. F.

机构：

来源：

DATA MINING IX: DATA MINING, PROTECTION, DETECTION AND OTHER SECURITY TECHNOLOGIES | 2008年 / 40卷

关键词：

term weighting; text categorization; text classification;

D O I：

10.2495/DATA080111

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In text categorization (TC) based on the vector space model, documents are represented as a vector, where each component is associated with a particular term from the text collection vocabulary. Traditionally, each component value is assigned using the information retrieval (IR) TFIDF measure. While this weighting method seems very appropriate for IR, weighting methods that take into account the importance of the term to the discrimination of the categories may provide better results in TC. To apply this idea, we use in this work variants of TFIDF weighting, where the Of part is replaced by functions used to conduct term selection. In an approach on real-world data to automatically distribute the legislative bills to the committees at the Federal District Legislative Assembly in Brasilia, Brazil, the replacement of the Of part in TFIDF by a new term selection measure - absl-logit - and by bi-normal separation [1] produced the best general classification results with support vector machines (SVM), when compared with TFIDF and with the use of common term selection measures - chi-square, information gain, gain ratio and odds ratio - to replace the idf part in TFIDF.

引用

页码：105 / 114

页数：10

共 13 条

[1] On logit confidence intervals for the odds ratio with small samples [J].