Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

被引：352

作者：

Lan, Man ^{[1
,2
]}

Tan, Chew Lim ^{[2
]}

Su, Jian ^{[3
]}

Lu, Yue ^{[1
]}

机构：

[1] E China Normal Univ, Dept Comp Sci & Technol, Shanghai 200241, Peoples R China

[2] Natl Univ Singapore, Sch Comp, Dept Comp Sci, Singapore 117590, Singapore

[3] Inst Infocomm Res, Singapore 119613, Singapore

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2009年 / 31卷 / 04期

关键词：

Text categorization; text representation; term weighting; SVM; kNN; RELEVANCE;

D O I：

10.1109/TPAMI.2008.110

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e., words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e., tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently a better performance than other term weighting methods while most supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.

引用

页码：721 / 735

页数：15

共 28 条

[1]

[Anonymous], P 14 INT C MACH LEAR

[2]

Buckley Chris., 1994, P TEXT RETRIEVAL C T, P69

[3] LIBSVM: A Library for Support Vector Machines [J].

Chang, Chih-Chung ;

Lin, Chih-Jen .

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)

[4]

DEBOLE F, 2003, P SAC 03 18 ACM S AP, P784

[5]

Deng ZH, 2004, LECT NOTES COMPUT SC, V3007, P588

[6] Approximate statistical tests for comparing supervised classification learning algorithms [J].

Dietterich, TG .

NEURAL COMPUTATION, 1998, 10 (07) :1895-1923

[7]

Dong Y.-S., 2005, Proceedings of the 2005 ACM symposium on Applied computing, SAC'05, P1044

[8]

Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651

[9]

Han E. H., 2001, P 5 PAC AS C KNOWL D, P53

[10]

Joachims T., 1998, MACHINE LEARNING ECM, P137, DOI [10.1007/BFb0026683, DOI 10.1007/BFB0026683]

← 1 2 3 →