Building web information extraction tasks

被引:1
作者
Habegger, B [1 ]
Quafafou, M [1 ]
机构
[1] Lab Informat Nantes Atlantique, F-44322 Nantes 3, France
来源
IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS | 2004年
关键词
D O I
10.1109/WI.2004.10116
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most recent research in the field of information extraction from the Web has concentrated on the task of extracting the underlying content of a set of similarly structured web pages. However in order to build real-world web information extraction applications this is not sufficient. Indeed, building such applications requires fully automating the access to web sources. This does not just involve the extraction of the data from web pages. There is a need to set up the necessary, infrastructure allowing to query a source, retrieve the result pages, extract the results from these pages and filter out the unwanted results. In this paper we show how such an infrastructure can be set up. We propose to build a web information extraction application by decomposing it into sub-tasks and describing it in an XML based language named WetDL. Each of the sub-tasks consists in applying a web information extraction specific operation onto its input, one of these operators being the application of an extractor By connecting such operations together it is possible to simply define complex applications. This is shown in the paper by applying this approach to real-world information extraction tasks such as extracting DVD listings front Ama-Zon.com, extracting addresses from online telephone directories superpages.corn, etc.
引用
收藏
页码:349 / 355
页数:7
相关论文
共 12 条
[1]  
CHANG CH, 2003, DECISION SUPPORT SYS, V35
[2]  
Crescenzi V., 2001, VLDB J, P109
[3]  
GAO X, 1999, 2 INT WORKSH INN INT
[4]  
HABEGGER B, 2002, ECAI 2002 P 15 EUR C
[5]  
HABEGGER B, 2004, IN PRESS LNCS
[6]  
Hamadi R., 2003, P 14 AUSTR DAT C DAT, V17, P191
[7]  
HSU CN, 1998, INFORMATION SYSTEMS, V23
[8]  
Kushmerick N, 2003, LECT NOTES COMPUT SC, V2888, P997
[9]  
KUSHMERICK N, 2000, ARTIFICIAL INTELLIGE
[10]  
MEDJAHED B, 2003, VLDB J, V12