TEX: An efficient and effective unsupervised Web information extractor

被引:36
作者
Sleiman, Hassan A. [1 ]
Corchuelo, Rafael [1 ]
机构
[1] Univ Seville, ETSI Informat, E-41012 Seville, Spain
关键词
Information extraction; Semi-structured web documents; Malformed documents; Unsupervised technique; Heuristic-based technique; WRAPPER INDUCTION; SYSTEM; PAGES; DOCUMENTS;
D O I
10.1016/j.knosys.2012.10.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The World Wide Web is an immense information resource. Web information extraction is the task that transforms human friendly Web information into structured information that can be consumed by automated business processes. In this article, we propose an unsupervised information extractor that works on two or more web documents generated by the same server side template. It finds and removes shared token sequences amongst these web documents until finding the relevant information that should be extracted from them. The technique is completely unsupervised and does not require maintenance, it allows working on malformed web documents, and does not require the relevant information to be formatted using repetitive patterns. Our complexity analysis reveals that our proposal is computationally tractable and our empirical study on real-world web documents demonstrates that it performs very fast and has a very high precision and recall. (c) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:109 / 123
页数:15
相关论文
共 68 条
  • [41] ViDE: A Vision-Based Approach for Deep Web Data Extraction
    Liu, Wei
    Meng, Xiaofeng
    Meng, Weiyi
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2010, 22 (03) : 447 - 460
  • [42] Meng X., 2003, Proceedings of the 5th ACM international workshop on Web information and data management, P1, DOI DOI 10.1145/956699.956701
  • [43] Miller RC, 1999, PROCEEDINGS OF THE 1999 USENIX ANNUAL TECHNICAL CONFERENCE, P131
  • [44] Mohammed Kayed, 2010, IEEE T KNOWL DATA EN
  • [45] Hierarchical wrapper induction for semistructured information sources
    Muslea, I
    Minton, S
    Knoblock, CA
    [J]. AUTONOMOUS AGENTS AND MULTI-AGENT SYSTEMS, 2001, 4 (1-2) : 93 - 114
  • [46] Muslea Ion, 2004, RISE REPOSITORY ONLI
  • [47] STAVIES: A system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques
    Papadakis, NK
    Skoutas, D
    Raftopoulos, K
    Varvarigou, TA
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (12) : 1638 - 1652
  • [48] Park J., 2007, PROC WWW 2007, P1335
  • [49] Clean up your Web pages with HP's HTML']HTML Tidy
    Raggett, D
    [J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7): : 730 - 732
  • [50] Automatically generating labeled examples for web wrapper maintenance
    Raposo, J
    Pan, A
    Alvarez, M
    Hidalgo, J
    [J]. 2005 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2005, : 250 - 256