Extracting web data using instance-based learning

被引:0
作者
Zhai, YH [1 ]
Liu, B [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Chicago, IL 60607 USA
来源
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005 | 2005年 / 3806卷
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper studies structured data extraction from Web pages, e,g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance (or page) to be extracted with labeled instances (or pages). The key advantage of our method is that it does not need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance (or page). Only when a new page cannot be extracted does the page need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled pages may not be representative of all other pages. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates and pages of the same template usually can be extracted using a single page instance of the template. The key issue is the similarity or distance measure. Traditional measures based on the Euclidean distance or text similarity are not easily applicable in this context because items to be extracted from different pages can be entirely different. This paper proposes a novel similarity measure for the purpose, which is suitable for templated. Web pages. Experimental results with product data extraction from 1200 pages in 24 diverse Web sites show that the approach is surprisingly effective. It outperforms the state-of-the-art existing systems significantly.
引用
收藏
页码:318 / 331
页数:14
相关论文
共 28 条
  • [1] Arasu A., 2003, SIGMOD 03
  • [2] BUNESCU R, 2003, ICML 2003 WORKSH MAC
  • [3] Califf ME, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P328
  • [4] Chang C.-H., 2004, IEEE INTELLIGENT SYS
  • [5] Self-pumped and mutually pumped phase conjugation in pentagon-shaped BaTiO3 crystal with plus c-face incident geometry
    Chang, CC
    Chen, TC
    Hu, GW
    Yau, HF
    Ye, PX
    [J]. PHOTOREFRACTIVE EFFECTS, MATERIALS AND DEVICES, PROCEEDINGS, 2001, 62 : 681 - 681
  • [6] COHEN W, 2002, 11 INT WORLD WID WEB
  • [7] Crescenzi V., 2001, VLDB J, P109
  • [8] Embley D. W., 1999, SIGMOD
  • [9] Feldman R., 2002, Computational Linguistics and Intelligent Text Processing. Third International Conference, CICLing 2002. Proceedings (Lecture Notes in Computer Science Vol.2276), P349
  • [10] Freitag D, 2000, SEVENTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-2001) / TWELFTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-2000), P577