A class of neural-network-based transducers for web information extraction

被引:13
作者
Sleiman, Hassan A. [1 ]
Corchuelo, Rafael [1 ]
机构
[1] Univ Seville, ETSI Informat, E-41012 Seville, Spain
关键词
web wrappers; web information extraction; neural networks; finite automata; machine learning; supervised method; WRAPPER INDUCTION;
D O I
10.1016/j.neucom.2013.05.057
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Web is a huge and still growing information repository that has attracted the attention of many companies. Many such companies rely on information extractors to integrate information that is buried into semi-structured web documents into automatic business processes. Many information extractors build on extraction rules, which can be handcrafted or learned using supervised or unsupervised techniques. The literature provides a variety of techniques to learn information extraction rules that build on ad hoc machine learning techniques. In this paper, we propose a hybrid approach that explores the use of standard machine-learning techniques to extract web information. We have specifically explored using neural networks; our results show that our proposal outperforms three state-of-the-art techniques in the literature, which opens up quite a new approach to information extraction. (c) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:61 / 68
页数:8
相关论文
共 30 条
[1]   Extracting lists of data records from semi-structured web pages [J].
Alvarez, Manuel ;
Pan, Alberto ;
Raposo, Juan ;
Bellas, Fernando ;
Cacheda, Fidel .
DATA & KNOWLEDGE ENGINEERING, 2008, 64 (02) :491-509
[2]  
[Anonymous], SIMOD 06 P 2006 ACM
[3]  
[Anonymous], 2011, INT ENCY STAT SCI
[4]  
Arasu A., 2003, P 2003 ACM SIGMOD IN, P337, DOI DOI 10.1145/872757.872799
[5]   Self-pumped and mutually pumped phase conjugation in pentagon-shaped BaTiO3 crystal with plus c-face incident geometry [J].
Chang, CC ;
Chen, TC ;
Hu, GW ;
Yau, HF ;
Ye, PX .
PHOTOREFRACTIVE EFFECTS, MATERIALS AND DEVICES, PROCEEDINGS, 2001, 62 :681-681
[6]  
Chang CH, 2006, IEEE T KNOWL DATA EN, V18, P1411, DOI 10.1109/TKDE.2006.152
[7]   Automatic information extraction from large Websites [J].
Crescenzi, V ;
Mecca, G .
JOURNAL OF THE ACM, 2004, 51 (05) :731-779
[8]   Grammars have exceptions [J].
Crescenzi, V ;
Mecca, G .
INFORMATION SYSTEMS, 1998, 23 (08) :539-565
[9]  
Elmeleegy Hazem., 2009, Proceedings of VLDB Endowment (PVLDB), V2, P1078
[10]  
Gupta Rahul., 2009, P VLDB ENDOWMENT PVL, V2, P289, DOI DOI 10.14778/1687627.1687661