Structrued and semantic data extraction from Web pages

被引:0
作者
Gan, Y [1 ]
Zhang, SZ [1 ]
机构
[1] Xian Jiaotong Univ, Sch Elect & Informat Engn, Xian 710049, Peoples R China
来源
PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7 | 2004年
关键词
data integration; data extraction; wrapper; Web source;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the development of the Internet, the Web has become an invaluable information source. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on the user pre-defined schema which generates automatically a wrapper to extract data from an HTML document, and produce an XML document conforming to given DTD. After the user define extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and leaning algorithm. The experiment indicates that the approach can extract the required data from the source document with high accuracy.
引用
收藏
页码:2930 / 2935
页数:6
相关论文
共 14 条
[1]  
BAUMGARTNER R, 2001, P VLDB 01 ROM IT SEP
[2]  
BUTTLER D, 2001, P IEEE INT C DISTR C
[3]  
CHAMBERLIN D, 2000, P INT WORKSH WEB DAT
[4]  
HAMMER J, 1997, P WORKSH MAN SEM DAT
[5]  
HAMMER J, 1997, P ACM SIGMOD 97 MAY
[6]  
HAN W, 2001, SIGMOD RECORD, V30
[7]  
KNOBLOCK CA, 1998, P AAAI 98 MAD WI
[8]  
KUSHMERICK N, 1997, P INT JOINT C ARTIFI
[9]  
LIU L, 2000, P INT C DAT ENG ICDE
[10]  
Merialdo P., 2001, P 27 VLDB C ROM IT