A class of neural-network-based transducers for web information extraction

被引:13
作者
Sleiman, Hassan A. [1 ]
Corchuelo, Rafael [1 ]
机构
[1] Univ Seville, ETSI Informat, E-41012 Seville, Spain
关键词
web wrappers; web information extraction; neural networks; finite automata; machine learning; supervised method; WRAPPER INDUCTION;
D O I
10.1016/j.neucom.2013.05.057
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Web is a huge and still growing information repository that has attracted the attention of many companies. Many such companies rely on information extractors to integrate information that is buried into semi-structured web documents into automatic business processes. Many information extractors build on extraction rules, which can be handcrafted or learned using supervised or unsupervised techniques. The literature provides a variety of techniques to learn information extraction rules that build on ad hoc machine learning techniques. In this paper, we propose a hybrid approach that explores the use of standard machine-learning techniques to extract web information. We have specifically explored using neural networks; our results show that our proposal outperforms three state-of-the-art techniques in the literature, which opens up quite a new approach to information extraction. (c) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:61 / 68
页数:8
相关论文
共 30 条
  • [1] Extracting lists of data records from semi-structured web pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    [J]. DATA & KNOWLEDGE ENGINEERING, 2008, 64 (02) : 491 - 509
  • [2] [Anonymous], SIMOD 06 P 2006 ACM
  • [3] [Anonymous], 2011, INT ENCY STAT SCI
  • [4] Arasu A., 2003, P 2003 ACM SIGMOD IN, P337, DOI DOI 10.1145/872757.872799
  • [5] Self-pumped and mutually pumped phase conjugation in pentagon-shaped BaTiO3 crystal with plus c-face incident geometry
    Chang, CC
    Chen, TC
    Hu, GW
    Yau, HF
    Ye, PX
    [J]. PHOTOREFRACTIVE EFFECTS, MATERIALS AND DEVICES, PROCEEDINGS, 2001, 62 : 681 - 681
  • [6] Chang CH, 2006, IEEE T KNOWL DATA EN, V18, P1411, DOI 10.1109/TKDE.2006.152
  • [7] Automatic information extraction from large Websites
    Crescenzi, V
    Mecca, G
    [J]. JOURNAL OF THE ACM, 2004, 51 (05) : 731 - 779
  • [8] Grammars have exceptions
    Crescenzi, V
    Mecca, G
    [J]. INFORMATION SYSTEMS, 1998, 23 (08) : 539 - 565
  • [9] Elmeleegy Hazem., 2009, Proceedings of VLDB Endowment (PVLDB), V2, P1078
  • [10] Gupta Rahul., 2009, P VLDB ENDOWMENT PVL, V2, P289, DOI DOI 10.14778/1687627.1687661