Structrued and semantic data extraction from Web pages

被引：0

作者：

Gan, Y ^{[1
]}

Zhang, SZ ^{[1
]}

机构：

[1] Xian Jiaotong Univ, Sch Elect & Informat Engn, Xian 710049, Peoples R China

来源：

PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7 | 2004年

关键词：

data integration; data extraction; wrapper; Web source;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the development of the Internet, the Web has become an invaluable information source. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on the user pre-defined schema which generates automatically a wrapper to extract data from an HTML document, and produce an XML document conforming to given DTD. After the user define extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and leaning algorithm. The experiment indicates that the approach can extract the required data from the source document with high accuracy.

引用

页码：2930 / 2935

页数：6