EGA: An algorithm for automatic semi-structured Web documents extraction

被引:0
作者
Li, LY [1 ]
Tang, SW
Yang, DQ
Wang, TJ
Su, ZH
机构
[1] Peking Univ, Natl Lab Machine Percept, Beijing 100871, Peoples R China
[2] Peking Univ, Dept Comp Sci, Beijing 100871, Peoples R China
来源
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS | 2004年 / 2973卷
关键词
information extraction; genetic algorithm; machine learning; semi-structured document; XPath;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the web. In this paper, we study how to extract information from the semi-structured web documents by automatically generated wrappers. To automate the wrapper generation and the data extraction process, we develop a novel algorithm EGA (EPattern Generation Algorithm) to conduct the extraction pattern based on the local structural context features of the web documents. These optimal or near optimal extraction patterns are described in XPath language. Experimental results on RISE and our own data sets confirm the feasibility of our approach.
引用
收藏
页码:787 / 798
页数:12
相关论文
共 17 条
[1]  
[Anonymous], 1999, XML PATH LANGUAGE XP
[2]  
BAUMGARTNER R, 2001, P VLDB
[3]   Self-pumped and mutually pumped phase conjugation in pentagon-shaped BaTiO3 crystal with plus c-face incident geometry [J].
Chang, CC ;
Chen, TC ;
Hu, GW ;
Yau, HF ;
Ye, PX .
PHOTOREFRACTIVE EFFECTS, MATERIALS AND DEVICES, PROCEEDINGS, 2001, 62 :681-681
[4]  
CRESCENZI V, VLDB 2001, P109
[5]   Machine learning for information extraction in informal domains [J].
Freitag, D .
MACHINE LEARNING, 2000, 39 (2-3) :169-202
[6]  
FREITAG D, 2000, P 17 NAT C ART INT 1
[7]  
FREITAG D, 1999, P AAAI 99 WORKSH MAC
[8]  
KOSALA R, P PKDD 2002
[9]  
KOSALA R, 2002, P 4 INT C INF INT WE
[10]   Wrapper induction: Efficiency and expressiveness [J].
Kushmerick, N .
ARTIFICIAL INTELLIGENCE, 2000, 118 (1-2) :15-68