n-Gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching

被引：0

作者：

Kim, Min-Soo ^{[1
,2
]}

Whang, Kyu-Young ^{[1
,2
]}

Lee, Jae-Gil ^{[1
,2
]}

机构：

[1] Korea Adv Inst Sci & Technol KAIST, Dept Comp Sci, Taejon 305701, South Korea

[2] Korea Adv Inst Sci & Technol KAIST, Adv Informat Technol Res Ctr AITrc, Taejon 305701, South Korea

来源：

COMPUTER SYSTEMS SCIENCE AND ENGINEERING | 2007年 / 22卷 / 06期

关键词：

approximate string matching; n-gram; inverted index;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: the query performance tends to be bad, and many false positives occur if a large number of errors are allowed. in this paper, we propose an inverted index structure, which we call the n-gram/2L-Approximation index, that improves these drawbacks and an approximate string matching algorithm based on it. The n-gram/2L-Approximation is an adaptation of the n-gram/2L index [4], which the authors have proposed earlier for exact matching. inheriting the advantages of the n-gram/2L index, the n-gram/2L-Approximation index reduces the size of the index and improves the query performance compared with the n-gram inverted index. In addition, the n-gram/2L-Approximation index reduces false positives compared with the n-gram inverted index if a large number of errors are allowed. We perform extensive experiments using the text and protein databases. Experimental results using databases of 1 GBytes show that the n-gram/2L-Approximation index reduces the index size by up to 1.8 times and, at the same time, improves the query performance by up to 4.2 times compared with those of the n-gram inverted index.

引用

页码：365 / 379

页数：15

共 50 条

[1] Efficient Inverted Index with n-Gram Sampling for String Matching in Arabic Documents
Nagoudi, El Moatez Billah
Khorsi, Ahmed
Cherroun, Hadda
2016 IEEE/ACS 13TH INTERNATIONAL CONFERENCE OF COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2016,
[2] FAST STRING-MATCHING USING AN N-GRAM ALGORITHM
KIM, JY
SHAWETAYLOR, J
SOFTWARE-PRACTICE & EXPERIENCE, 1994, 24 (01): : 79 - 88
[3] Implementation of PDF Crawler Using Boolean Inverted index and n-gram model
Kadwe, Snehal S.
Ardhapurkar, Shrikant
2017 2ND IEEE INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2017, : 2680 - 2683
[4] N-gram Index Structure Study for Semantic based Mathematical Formula
Xu, Yuexia
Su, Wei
Cheng, Ming
Qu, Zhiyi
Li, Hui
2014 TENTH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2014, : 293 - 298
[5] N-gram measures and L2 writing proficiency
Garner, James
Crossley, Scott
Kyle, Kristopher
SYSTEM, 2019, 80 : 176 - 187
[6] The Growing N-Gram Algorithm: A Novel Approach to String Clustering
Grappiolo, Corrado
Verwielen, Eline
Noorman, Nils
ICPRAM: PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 2019, : 52 - 63
[7] Byte Level n-Gram Analysis for Malware Detection
Jain, Sacbin
Meena, Yogesb Kumar
COMPUTER NETWORKS AND INTELLIGENT COMPUTING, 2011, 157 : 51 - 59
[8] On the N-gram Approximation of Pre-trained Language Models
Krishnan, Aravind
Alabi, Jesujoba O.
Klakow, Dietrich
INTERSPEECH 2023, 2023, : 371 - 375
[9] Language modeling by string pattern N-gram for Japanese speech recognition
Ito, A
Kohda, M
ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 490 - 493
[10] Tuning N-gram String Kernel SVMs via Meta Learning
Gunasekara, Nuwan
Pang, Shaoning
Kasabov, Nikola
NEURAL INFORMATION PROCESSING: MODELS AND APPLICATIONS, PT II, 2010, 6444 : 91 - 98

← 1 2 3 4 5 →