Augmenting Tables by Self-supervised Web Search

被引:0
作者
Loeser, Alexander [1 ]
Nagel, Christoph [1 ]
Pieper, Stephan [1 ]
机构
[1] Tech Univ Berlin, DIMA Grp, D-10587 Berlin, Germany
来源
ENABLING REAL-TIME BUSINESS INTELLIGENCE | 2011年 / 84卷
关键词
information extraction; document collections; query optimization; INFORMATION EXTRACTION;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Often users are faced with the problem of searching the Web for missing values of a spread sheet. It is a fact that today only a few US-based search engines have the capacity to aggregate the wealth of information hidden in Web pages that could be used to return these missing values. Therefore exploiting this information with structured queries, such as join queries, is an often requested, but still unsolved requirement of many Web users. A major challenge in this scenario is identifying keyword queries for retrieving relevant pages from a Web search engine. We solve this challenge by automatically generating keywords. Our approach is based on the observation that Web page authors have already evolved common words and grammatical structures for describing important relationship types. Each keyword query should return only pages that likely contain a missing relation. Therefore our keyword generator continually monitors grammatical structures or lexical phrases from processed Web pages during query execution. Thereby, the keyword generator infers significant and non-ambiguous keywords for retrieving pages which likely match the mechanics of a particular relation extractor. We report an experimental study over multiple relation extractors. Our study demonstrates that our generated keywords efficiently return complete result tuples. In contrast to other approaches we only process very few Web pages.
引用
收藏
页码:84 / 99
页数:16
相关论文
共 16 条
[1]  
Agichtein E., 2003, SIGMOD 03, P663
[2]  
[Anonymous], 2010, SoCC, DOI DOI 10.1145/1807128.1807148
[3]  
Croft W.B., 2010, INFORM RETRIEVAL PRA, P313
[4]  
Dong X., 2005, P 2005 ACM SIGMOD IN, P85, DOI DOI 10.1145/1066157.1066168
[5]   Open Information Extraction from the Web [J].
Etzioni, Oren ;
Banko, Michele ;
Soderland, Stephen ;
Weld, Daniel S. .
COMMUNICATIONS OF THE ACM, 2008, 51 (12) :68-74
[6]   A modular information extraction system [J].
Feldman, Ronen ;
Regev, Yizhar ;
Gorodetsky, Maya .
INTELLIGENT DATA ANALYSIS, 2008, 12 (01) :51-71
[7]   Discriminative category matching: Efficient text classification for huge document collections [J].
Fung, GPC ;
Yu, JX ;
Lu, HJ .
2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, :187-194
[8]  
Galhardas H., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P371
[9]   Optimizing SQL queries over text databases [J].
Jain, Alpa ;
Doan, AnHai ;
Gravano, Luis .
2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, :636-+
[10]   Exploring a Few Good Tuples From Text Databases [J].
Jain, Alpa ;
Srivastava, Divesh .
ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, :616-+