n-Gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching

被引:0
|
作者
Kim, Min-Soo [1 ,2 ]
Whang, Kyu-Young [1 ,2 ]
Lee, Jae-Gil [1 ,2 ]
机构
[1] Korea Adv Inst Sci & Technol KAIST, Dept Comp Sci, Taejon 305701, South Korea
[2] Korea Adv Inst Sci & Technol KAIST, Adv Informat Technol Res Ctr AITrc, Taejon 305701, South Korea
来源
COMPUTER SYSTEMS SCIENCE AND ENGINEERING | 2007年 / 22卷 / 06期
关键词
approximate string matching; n-gram; inverted index;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: the query performance tends to be bad, and many false positives occur if a large number of errors are allowed. in this paper, we propose an inverted index structure, which we call the n-gram/2L-Approximation index, that improves these drawbacks and an approximate string matching algorithm based on it. The n-gram/2L-Approximation is an adaptation of the n-gram/2L index [4], which the authors have proposed earlier for exact matching. inheriting the advantages of the n-gram/2L index, the n-gram/2L-Approximation index reduces the size of the index and improves the query performance compared with the n-gram inverted index. In addition, the n-gram/2L-Approximation index reduces false positives compared with the n-gram inverted index if a large number of errors are allowed. We perform extensive experiments using the text and protein databases. Experimental results using databases of 1 GBytes show that the n-gram/2L-Approximation index reduces the index size by up to 1.8 times and, at the same time, improves the query performance by up to 4.2 times compared with those of the n-gram inverted index.
引用
收藏
页码:365 / 379
页数:15
相关论文
empty
未找到相关数据