SystemT: A System for Declarative Information Extraction

被引:66
作者
Krishnamurthy, Rajasekar [1 ]
Li, Yunyao [1 ]
Raghavan, Sriram [1 ]
Reiss, Frederick [1 ]
Vaithyanathan, Shivakumar [1 ]
Zhu, Huaiyu [1 ]
机构
[1] IBM Corp, Almaden Res Ctr, Armonk, NY 10504 USA
关键词
Data mining;
D O I
10.1145/1519103.1519105
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) - the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammar-based extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and cost-based optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.
引用
收藏
页码:7 / 13
页数:7
相关论文
共 12 条
[1]  
AGICHTEIN E, 2006, SCALABLE INFORM EXTR
[2]  
[Anonymous], 2000, CS0010 U SHEFF DEP C
[3]  
[Anonymous], SYSTEM TEXT INFORM E
[4]  
Appelt D., 1998, TIPSTER WORKSH
[5]  
COHEN W, 2003, INFORM EXTRACTION WO
[6]  
DOAN A, 2006, MANAGING INFORM EXTR
[7]  
Freitag D., 1998, ICML
[8]  
Lafferty J., 2001, PROC ICML, DOI 10.29122/mipi.v11i1.2792
[9]  
PENG F, 2004, HLT NAACL
[10]  
Reiss F., 2008, ICDE