Open Information Extraction from the Web

被引:284
作者
Etzioni, Oren [1 ]
Banko, Michele
Soderland, Stephen [2 ]
Weld, Daniel S.
机构
[1] Univ Washington, Turing Ctr, Seattle, WA 98195 USA
[2] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
关键词
D O I
10.1145/1409360.1409378
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Open Information Extraction (IE), where the identities of the relations to be extracted are unknown and the billions of documents found on the web necessitate highly scalable processing, is a reliable way of extracting information from the Internet. The first IE systems relied on some form of pattern-matching rules that were manually crafted for each domain. Modern IE automatically learns an extractor from a training set in which domain-specific examples are tagged. The development of suitable training data for IE requires substantial effort and expertise. The Know-ItAll web IE system automates IE by learning to label its own training examples using only a small set of domain-independent extraction patterns. TextRunner is a fully implemented Open IE system that utilizes the two-phase architecture. It's first phase uses a general model of language, which trains a graphical model called a conditional random field (CRF). Open IE also supports aggregating, fusing information across a large number of web pages.
引用
收藏
页码:68 / 74
页数:7
相关论文
共 24 条
  • [1] AGICHTEIN E, 2000, P 5 ACM INT C DIG LI
  • [2] ARPA, 1991, P 3 MESS UND C
  • [3] Banko M., 2007, P INT JOINT C ART IN
  • [4] BANKO M, 2008, P ASS COMP LING
  • [5] Brin S, 1999, LECT NOTES COMPUT SC, V1590, P172
  • [6] BUNESEU R, 2007, P ASS COMP LING
  • [7] DOWNEY D, 2005, P INT JOINT C ART IN
  • [8] DOWNEY D, 2007, P ASS COMP LING
  • [9] Unsupervised named-entity extraction from the Web: An experimental study
    Etzioni, O
    Cafarella, M
    Downey, D
    Popescu, AM
    Shaked, T
    Soderland, S
    Weld, DS
    Yates, A
    [J]. ARTIFICIAL INTELLIGENCE, 2005, 165 (01) : 91 - 134
  • [10] FELDMAN R, 2006, P INT S METH INT SYS, P755