Discovering Informative Contents of Web Pages

被引:0
作者
Fan, Qifeng [1 ]
Yan, Chunwei [1 ]
Huang, Lifu [1 ]
Huang, Lian'en [1 ]
机构
[1] Peking Univ, Shenzhen Key Lab Cloud Comp Technol & Applicat, Shenzhen Grad Sch, Shenzhen, Guangdong, Peoples R China
来源
WEB-AGE INFORMATION MANAGEMENT, WAIM 2014 | 2014年 / 8485卷
关键词
Template Detection; Information Extraction; Entropy;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The World Wide Web has become a huge information repository. However, besides informative contents, the Web pages also contain redundant contents, which are considered harmful for Web mining and searching systems. In this paper, we propose a new approach to discover informative contents from a set of Web pages within a single Web site. Our method works as follows: First, we propose a newly designed Site Style Tree, to capture the common presentation styles and the actual contents of the pages in the given Web site. The tree structure, which is different from the one formerly proposed, is built by aligning pages of the site. For each node of SST, informative contents are discovered based on entropy and threshold method. The proposed approach is evaluated with two mining tasks, Web page clustering and classification. The experimental performance shows a significant improvement when compared to previous template detection approaches.
引用
收藏
页码:180 / 191
页数:12
相关论文
共 19 条
  • [1] [Anonymous], 2003, VIPS VISION BASED PA
  • [2] [Anonymous], 2002, Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
  • [3] Bar-Yossef Z., 2002, P 11 INT C WORLD WID, P580, DOI DOI 10.1145/511446.511522
  • [4] Chakrabarti Deepayan., 2007, P 16 INT C WORLD WID, P61, DOI DOI 10.1145/1242572.1242582
  • [5] Davison BrianD., 2000, ARTIF INTELL, P23
  • [6] Automatic identification of informative sections of Web pages
    Debnath, S
    Mitra, P
    Pal, N
    Giles, CL
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (09) : 1233 - 1246
  • [7] Fernandes D, 2011, PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), P215
  • [8] Gibson D., 2005, INT WORLD WIDE WEB C, P830
  • [9] Hyung-Yu Kao, 2002, Proceedings of the Eleventh International Conference on Information and Knowledge Management. CIKM 2002, P574
  • [10] WISDOM: Web intrapage informative structure mining based on document object model
    Kao, HY
    Ho, JM
    Chen, MS
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (05) : 614 - 627