A Multi-label and Adaptive Genre Classification of Web Pages

被引:5
作者
Jebari, Chaker [1 ]
Wani, M. Arif [2 ]
机构
[1] Fac Sci, Comp Sci Dept, Tunis, Tunisia
[2] Calif State Univ Backersfield, Comp & Elect Engn & Comp Sci Dept, Backersfield, CA USA
来源
2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 1 | 2012年
关键词
Multi-label; classification; genre; centroid; adaptive;
D O I
10.1109/ICMLA.2012.106
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a new centroid-based approach to classify web pages by genre using character n-grams extracted from different information sources such as URL, title, headings and anchors. To deal with the complexity of web pages and the rapid evolution of web genres, our approach implements a multi-label and adaptive classification scheme in which web pages are classified one by one and each web page can affect more than one genre. According to the similarity between the new page and each genre centroid, our approach either adapts the genre centroid under consideration or considers the new page as noise page and discards it. The experiment results show that our approach is very fast and achieves better results than existing multi-label classifiers.
引用
收藏
页码:578 / 581
页数:4
相关论文
共 10 条
  • [1] [Anonymous], TECHNICAL REPORT
  • [2] [Anonymous], THESIS
  • [3] [Anonymous], 31 HAW INT C SYST SC
  • [4] Godbole S., 2004, 8 PAC AS C KNOWL DIS
  • [5] Salton G., 1988, AUTOMATIC TEXT PROCE
  • [6] Santini Marina, 2007, THESIS
  • [7] Machine learning in automated text categorization
    Sebastiani, F
    [J]. ACM COMPUTING SURVEYS, 2002, 34 (01) : 1 - 47
  • [8] Tsoumakas G., 2007, International Journal of Data Warehousing and Mining (I.IDWM), V3, P1, DOI 10.4018/jdwm.2007070101
  • [9] Vidulin V., 2007, P INT WORKSHOP GENRE, P45
  • [10] Vidulin V., 2009, P JLCL C, V24, P97