n-Gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching

被引:0
|
作者
Kim, Min-Soo [1 ,2 ]
Whang, Kyu-Young [1 ,2 ]
Lee, Jae-Gil [1 ,2 ]
机构
[1] Korea Adv Inst Sci & Technol KAIST, Dept Comp Sci, Taejon 305701, South Korea
[2] Korea Adv Inst Sci & Technol KAIST, Adv Informat Technol Res Ctr AITrc, Taejon 305701, South Korea
来源
COMPUTER SYSTEMS SCIENCE AND ENGINEERING | 2007年 / 22卷 / 06期
关键词
approximate string matching; n-gram; inverted index;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: the query performance tends to be bad, and many false positives occur if a large number of errors are allowed. in this paper, we propose an inverted index structure, which we call the n-gram/2L-Approximation index, that improves these drawbacks and an approximate string matching algorithm based on it. The n-gram/2L-Approximation is an adaptation of the n-gram/2L index [4], which the authors have proposed earlier for exact matching. inheriting the advantages of the n-gram/2L index, the n-gram/2L-Approximation index reduces the size of the index and improves the query performance compared with the n-gram inverted index. In addition, the n-gram/2L-Approximation index reduces false positives compared with the n-gram inverted index if a large number of errors are allowed. We perform extensive experiments using the text and protein databases. Experimental results using databases of 1 GBytes show that the n-gram/2L-Approximation index reduces the index size by up to 1.8 times and, at the same time, improves the query performance by up to 4.2 times compared with those of the n-gram inverted index.
引用
收藏
页码:365 / 379
页数:15
相关论文
共 50 条
  • [1] Efficient Inverted Index with n-Gram Sampling for String Matching in Arabic Documents
    Nagoudi, El Moatez Billah
    Khorsi, Ahmed
    Cherroun, Hadda
    2016 IEEE/ACS 13TH INTERNATIONAL CONFERENCE OF COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2016,
  • [2] FAST STRING-MATCHING USING AN N-GRAM ALGORITHM
    KIM, JY
    SHAWETAYLOR, J
    SOFTWARE-PRACTICE & EXPERIENCE, 1994, 24 (01): : 79 - 88
  • [3] Implementation of PDF Crawler Using Boolean Inverted index and n-gram model
    Kadwe, Snehal S.
    Ardhapurkar, Shrikant
    2017 2ND IEEE INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2017, : 2680 - 2683
  • [4] N-gram Index Structure Study for Semantic based Mathematical Formula
    Xu, Yuexia
    Su, Wei
    Cheng, Ming
    Qu, Zhiyi
    Li, Hui
    2014 TENTH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2014, : 293 - 298
  • [5] N-gram measures and L2 writing proficiency
    Garner, James
    Crossley, Scott
    Kyle, Kristopher
    SYSTEM, 2019, 80 : 176 - 187
  • [6] The Growing N-Gram Algorithm: A Novel Approach to String Clustering
    Grappiolo, Corrado
    Verwielen, Eline
    Noorman, Nils
    ICPRAM: PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 2019, : 52 - 63
  • [7] Byte Level n-Gram Analysis for Malware Detection
    Jain, Sacbin
    Meena, Yogesb Kumar
    COMPUTER NETWORKS AND INTELLIGENT COMPUTING, 2011, 157 : 51 - 59
  • [8] On the N-gram Approximation of Pre-trained Language Models
    Krishnan, Aravind
    Alabi, Jesujoba O.
    Klakow, Dietrich
    INTERSPEECH 2023, 2023, : 371 - 375
  • [9] Language modeling by string pattern N-gram for Japanese speech recognition
    Ito, A
    Kohda, M
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 490 - 493
  • [10] Tuning N-gram String Kernel SVMs via Meta Learning
    Gunasekara, Nuwan
    Pang, Shaoning
    Kasabov, Nikola
    NEURAL INFORMATION PROCESSING: MODELS AND APPLICATIONS, PT II, 2010, 6444 : 91 - 98