Training the genre classifier for automatic classification of web pages

被引:3
作者
Vidulin, Vedrana [1 ]
Lustrek, Mitja [1 ]
Gams, Matjaz [1 ]
机构
[1] Jozef Stefan Inst, Jamova 39, SI-1000 Ljubljana, Slovenia
来源
PROCEEDINGS OF THE ITI 2007 29TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY INTERFACES | 2007年
关键词
genre classification; web page; genre features; ensemble algorithm;
D O I
10.1109/ITI.2007.4283750
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) and one ensemble algorithm (bagging), were trained and tested on the data set. The ensemble algorithm achieved on average 17% better precision and 1.6% better accuracy, but slightly worse recall; F-measure did not vary significantly. The results indicate that classification by genre could be a useful addition to search engines.
引用
收藏
页码:93 / +
页数:2
相关论文
共 20 条
[1]  
[Anonymous], P 35 ANN M ASS COMP
[2]  
ARGAMON S, 1998, 1 INT WORKSH INN INF
[3]  
BERNERSLEE T, 2005, RFC, V3986, P66
[4]  
BUNTINE W, 2007, COMMUNICATION
[5]  
DEWDNEY N, 1998, FORM SUBSTANCE CLASS
[6]  
FINN A, 2002, MACHINE LEARNING GEN
[7]  
KARLGREN J, 2000, THESIS
[8]  
Karlgren Jussi, 1994, P 15 INT C COMP LING
[9]  
Kohavi R., 1995, INT JOINT C ART INT, P1137, DOI DOI 10.1067/MOD.2000.109031
[10]  
LARGE A, 1999, INFORM SEEKING ONLI