Schema Inference and Data Extraction from Templatized Web Pages

被引:0
作者
Krishna, Shinde Santaji [1 ]
Dattatraya, Joshi Shashank [2 ]
机构
[1] Shri Jagdish Prasad Jhabarmal Tibrewala Univ, Dept Comp Engn, Jhunjhunu, Rajasthan, India
[2] Bharati Vidyapeeth Deemed Univ, Coll Engn, Dept Comp Engn, Pune, Maharashtra, India
来源
2015 INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING (ICPC) | 2015年
关键词
Data Extraction; Multiple Tree Merging; Schema; Vision-based Page Segmentation; Web page;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.
引用
收藏
页数:6
相关论文
共 17 条
  • [1] Arasu A., 2003, P 2003 ACM SIGMOD IN, P337, DOI DOI 10.1145/872757.872799
  • [2] Chang C.-H., P INT C WORLD WID WE, P223
  • [3] Chang Cheng David, THESIS
  • [4] Chang CH, 2006, IEEE T KNOWL DATA EN, V18, P1411, DOI 10.1109/TKDE.2006.152
  • [5] Crescenzi Valter, P 27 VLDB
  • [6] Deng C. A. I., VIPS VISION BASED PA
  • [7] FiVaTech: Page-Level Web Data Extraction from Template Pages
    Kayed, Mohammed
    Chang, Chia-Hui
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2010, 22 (02) : 249 - 263
  • [8] Kushmerick N, 1997, INT JOINT CONF ARTIF, P729
  • [9] Liu B., 2003, P 9 ACM SIGKDD INT C, P601
  • [10] Liu L., 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), P611, DOI 10.1109/ICDE.2000.839475