An Evaluation of Statistical Approaches to Text Categorization

被引:609
作者
Yiming Yang
机构
来源
Information Retrieval | 1999年 / 1卷 / 1-2期
关键词
text categorization; statistical learning algorithms; comparative study; evaluation;
D O I
10.1023/A:1009982220290
中图分类号
学科分类号
摘要
This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performance; except for a Naive Bayes approach, the other learning algorithms also performed relatively well.
引用
收藏
页码:69 / 90
页数:21
相关论文
共 5 条
[1]  
Creecy RH(1992)Trading mips and memory for knowledge engineering: Classifying census returns on the connection machine Comm. ACM 35 48-63
[2]  
Masand BM(1986)Induction of decision trees Machine Learning 1 81-106
[3]  
Smith SJ(undefined)undefined undefined undefined undefined-undefined
[4]  
Waltz DL(undefined)undefined undefined undefined undefined-undefined
[5]  
Quinlan JR(undefined)undefined undefined undefined undefined-undefined