Research of Extracting Data from HTML Web Pages Automatically

被引：0

作者：

王茹

宋瀚涛

陆玉昌

机构：

[1] Beijing 100081

[2] Beijing 100084

[3] Beijing Institute of Technology

[4] China

[5] Department of Computer Science and Engineering

[6] School of Information Science and Technology

[7] State Key Laboratory of Intelligent Technology and System

[8] Tsinghua University

来源：

Journal of Beijing Institute of Technology | 2003年 / S1期

关键词：

information extraction; data transformation; wrapper; HTML page;

D O I：

10.15918/j.jbit1004-0579.2003.s1.023

中图分类号：

TP393.092 [];

学科分类号：

080402 ;

摘要：

In order to use data information in the Internet,it is necessary to extract data from web pages.An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generationalgorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate thewrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate thewrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.

引用

页码：104 / 108

页数：5