FIRST-ORDER LOGIC RULE INDUCTION FOR INFORMATION EXTRACTION IN WEB RESOURCES

被引:6
作者
Ignacio Fernandez-Villamor, Jose [1 ]
Angel Iglesias, Carlos [1 ]
Garijo, Mercedes [1 ]
机构
[1] Univ Politecn Madrid, Dept Ingn Sistemas Telemat, E-28040 Madrid, Spain
关键词
Information extraction; first order logic; machine learning; semantic web;
D O I
10.1142/S0218213012500327
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information extraction out of web pages, commonly known as screen scraping, is usually performed through wrapper induction, a technique that is based on the internal structure of HTML documents. As such, the main limitation of these kinds of techniques is that a generated wrapper is only useful for the web page it was designed for. To overcome this, in this paper it is proposed a system that generates first-order logic rules that can be used to extract data from web pages. These rules are based on visual features such as font size, elements positioning or types of contents. Thus, they do not depend on a document's internal structure, and are able to work on different sites. The system has been validated on a set of different web pages, showing very high precision and good recall, which validates the robustness and the generalization capabilities of the approach.
引用
收藏
页数:20
相关论文
共 31 条
[1]  
[Anonymous], 2005, P 2005 ACM S APPL CO
[2]  
[Anonymous], 2005, LWA 2005-Workshopwoche der GI-Fachgruppen/Arbeitskreise
[3]  
Arasu A., 2003, P 2003 ACM SIGMOD IN, P337, DOI DOI 10.1145/872757.872799
[4]  
Barnard David T., 1995, TREE TO TREE CORRECT
[5]   The Semantic Web - A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities [J].
Berners-Lee, T ;
Hendler, J ;
Lassila, O .
SCIENTIFIC AMERICAN, 2001, 284 (05) :34-+
[6]   A survey on tree edit distance and related problems [J].
Bille, P .
THEORETICAL COMPUTER SCIENCE, 2005, 337 (1-3) :217-239
[7]  
Bizer C., 2009, SBC, V14, P9
[8]   Top-down induction of first-order logical decision trees [J].
Blockeel, H ;
De Raedt, L .
ARTIFICIAL INTELLIGENCE, 1998, 101 (1-2) :285-297
[9]  
Breslin JG, 2005, LECT NOTES COMPUT SC, V3532, P500
[10]  
Brickley D, 2005, FOAF vocabulary specification