Finite-State Machines for Mining Patterns in Very Large Text Repositories

被引:0
作者
Skut, Wojciech [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
来源
FINITE-STATE METHODS AND NATURAL LANGUAGE PROCESSING | 2009年 / 191卷
关键词
search engines; text mining; finite-state machines; string matching; complex patterns; OpenFST;
D O I
10.3233/978-1-58603-975-2-23
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The emergence of WWW search engines since the 1990s has changed the scale of many natural language processing applications. Text mining, information extraction and related tasks can now be applied to tens of billions of documents, which sets new efficiency standards for NLP algorithms. Finite-state machines are an obvious choice of a formal framework for such applications. However, the scale of the problem (size of the searchable corpus, number of patterns to be matched) often poses a problem even to well-established finite-state string matching techniques. In my presentation. I will focus on the experience gained in the implementation a finite-state matching library optimized for searching large amounts of complex patterns in a WWW-scale repository of documents. Both algorithmic and implementation-related aspects of the task will be discussed. The library is based on OpenFST.
引用
收藏
页码:23 / 23
页数:1
相关论文
empty
未找到相关数据