Research of Extracting Data from HTML Web Pages Automatically

被引:0
作者
王茹
宋瀚涛
陆玉昌
机构
[1] Beijing 100081
[2] Beijing 100084
[3] Beijing Institute of Technology
[4] China
[5] Department of Computer Science and Engineering
[6] School of Information Science and Technology
[7] State Key Laboratory of Intelligent Technology and System
[8] Tsinghua University
关键词
information extraction; data transformation; wrapper; HTML page;
D O I
10.15918/j.jbit1004-0579.2003.s1.023
中图分类号
TP393.092 [];
学科分类号
080402 ;
摘要
In order to use data information in the Internet,it is necessary to extract data from web pages.An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generationalgorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate thewrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate thewrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.
引用
收藏
页码:104 / 108
页数:5
相关论文
共 4 条
[1]  
A brief survey of web data extraction tools. Alberto H F,Berthier A. ACM SIGMOD Record . 2002
[2]  
Extracting structured data from web pages. Arasu A,Garcia-Molina H. . 2002
[3]  
RoadRunner:Towards automatic data extraction from large web sites. Crescenzi V,Mecca G. . 2001
[4]  
On automatic information extraction from large web sites. Crescenzi V,Mecca G. RT-DIA-76- . 2003