Open Information Extraction from the Web

被引:299
作者
Etzioni, Oren [1 ]
Banko, Michele
Soderland, Stephen [2 ]
Weld, Daniel S.
机构
[1] Univ Washington, Turing Ctr, Seattle, WA 98195 USA
[2] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
关键词
D O I
10.1145/1409360.1409378
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Open Information Extraction (IE), where the identities of the relations to be extracted are unknown and the billions of documents found on the web necessitate highly scalable processing, is a reliable way of extracting information from the Internet. The first IE systems relied on some form of pattern-matching rules that were manually crafted for each domain. Modern IE automatically learns an extractor from a training set in which domain-specific examples are tagged. The development of suitable training data for IE requires substantial effort and expertise. The Know-ItAll web IE system automates IE by learning to label its own training examples using only a small set of domain-independent extraction patterns. TextRunner is a fully implemented Open IE system that utilizes the two-phase architecture. It's first phase uses a general model of language, which trains a graphical model called a conditional random field (CRF). Open IE also supports aggregating, fusing information across a large number of web pages.
引用
收藏
页码:68 / 74
页数:7
相关论文
共 24 条
[1]  
AGICHTEIN E, 2000, P 5 ACM INT C DIG LI
[2]  
ARPA, 1991, P 3 MESS UND C
[3]  
Banko M., 2007, P INT JOINT C ART IN
[4]  
BANKO M, 2008, P ASS COMP LING
[5]  
Brin S, 1999, LECT NOTES COMPUT SC, V1590, P172
[6]  
BUNESEU R, 2007, P ASS COMP LING
[7]  
DOWNEY D, 2005, P INT JOINT C ART IN
[8]  
DOWNEY D, 2007, P ASS COMP LING
[9]   Unsupervised named-entity extraction from the Web: An experimental study [J].
Etzioni, O ;
Cafarella, M ;
Downey, D ;
Popescu, AM ;
Shaked, T ;
Soderland, S ;
Weld, DS ;
Yates, A .
ARTIFICIAL INTELLIGENCE, 2005, 165 (01) :91-134
[10]  
FELDMAN R, 2006, P INT S METH INT SYS, P755