Web Genre Classification via Hierarchical Multi-label Classification

被引:4
作者
Madjarov, Gjorgji [1 ]
Vidulin, Vedrana [2 ]
Dimitrovski, Ivica [1 ]
Kocev, Dragi [3 ,4 ]
机构
[1] Ss Cyril & Methodius Univ, Fac Comp Sci & Engn, Skopje, Macedonia
[2] Rudjer Boskovic Inst, Zagreb, Croatia
[3] Univ Bari Aldo Moro, Dept Informat, Bari, Italy
[4] Jozef Stefan Inst, Dept Knowledge Technol, Ljubljana, Slovenia
来源
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2015 | 2015年 / 9375卷
关键词
Web genre classification; Hierarchy construction; Hierarchical multi-label classification;
D O I
10.1007/978-3-319-24834-9_2
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The increase of the number of web pages prompts for improvement of the search engines. One such improvement can be by specifying the desired web genre of the result web pages. This opens the need for web genre prediction based on the information on the web page. Typically, this task is addressed as multi-class classification, with some recent studies advocating the use of multi-label classification. In this paper, we propose to exploit the web genres labels by constructing a hierarchy of web genres and then use methods for hierarchical multi-label classification to boost the predictive performance. We use two methods for hierarchy construction: expert-based and data-driven. The evaluation on a benchmark dataset (20-Genre collection corpus) reveals that using a hierarchy of web genres significantly improves the predictive performance of the classifiers and that the data-driven hierarchy yields similar performance as the expert-driven with the added value that it was obtained automatically and fast.
引用
收藏
页码:9 / 17
页数:9
相关论文
共 12 条
[1]  
[Anonymous], 2009, INTRO INFORM RETRIEV
[2]  
Crowston K., 2011, GENRES ON THE WEB, P69
[3]   Tree ensembles for predicting structured outputs [J].
Kocev, Dragi ;
Vens, Celine ;
Struyf, Jan ;
Dzeroski, Saso .
PATTERN RECOGNITION, 2013, 46 (03) :817-833
[4]  
Madjarov Gjorgji, 2015, New Frontiers in Mining Complex Patterns. Third International Workshop, NFMCP 2014, held in conjunction with ECML-PKDD 2014. Revised Selected Papers: LNCS 8983, P19, DOI 10.1007/978-3-319-17876-9_2
[5]   An extensive experimental comparison of methods for multi-label learning [J].
Madjarov, Gjorgji ;
Kocev, Dragi ;
Gjorgjevikj, Dejan ;
Dzeroski, Saso .
PATTERN RECOGNITION, 2012, 45 (09) :3084-3104
[6]  
Santini M., 2011, GENRES WEB COMPUTATI, P87
[7]  
SANTINI M, 2007, THESIS U BRIGHTON
[8]   A survey of hierarchical classification across different application domains [J].
Silla, Carlos N., Jr. ;
Freitas, Alex A. .
DATA MINING AND KNOWLEDGE DISCOVERY, 2011, 22 (1-2) :31-72
[9]   Genre as noise: noise in genre [J].
Stubbe, Andrea ;
Ringlstetter, Christoph ;
Schulz, Klaus U. .
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2007, 10 (3-4) :199-209
[10]  
Tsoumakas Grigorios, 2008, PROC ECMLPKDD WORKSH, V21, P53