Efficient Regular Expression Matching Based on Positional Inverted Index

被引:2
作者
Qiu, Tao [1 ]
Yang, Xiaochun [1 ,2 ]
Wang, Bin [1 ]
Wang, Wei [3 ]
机构
[1] Northeastern Univ, Sch Comp Sci & Engn, Shenyang 110819, Liaoning, Peoples R China
[2] Beijing Inst Technol, Beijing 100811, Peoples R China
[3] Univ New South Wales, Sch Comp Sci & Engn, Sydney, NSW 2052, Australia
基金
中国国家自然科学基金;
关键词
Regular expression matching; positional inverted index; query plan; DATABASE;
D O I
10.1109/TKDE.2020.2992295
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the efficient regular expression (regex) matching problem. Existing algorithms are the scanning-based algorithms which typically use an equivalent automaton compiled from the regex query to verify a document. Although some works propose various strategies to quickly jump to candidate locations in a document where a query result may appear, they still need to utilize the scanning-based method to verify these candidate locations. These methods become inefficient when there are still many candidate locations needed to be verified. In this article, we propose a novel approach to efficiently compute all matching positions for a regex query purely based on a positional q-gram inverted index. We propose a gram-driven NFA to represent the language of a regex and show all regex matching locations can be obtained by finding positions on q-grams of GNFA that satisfy certain positional constraints. Then we propose several GNFA-based query plans to answer the query using the positional inverted index. In order to improve the query efficiency, we design the algorithm to build a tree-based query plan by carefully choosing a checking order for positional constraints. Experimental results on real-world datasets show that our method outperforms state-of-the-art methods by up to an order of magnitude in query efficiency.
引用
收藏
页码:1133 / 1148
页数:16
相关论文
共 38 条
[11]   Fast Set Intersection in Memory [J].
Ding, Bolin ;
Koenig, Arnd Christian .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (04) :255-266
[12]   The PROSITE database, its status in 2002 [J].
Falquet, L ;
Pagni, M ;
Bucher, P ;
Hulo, N ;
Sigrist, CJA ;
Hofmann, K ;
Bairoch, A .
NUCLEIC ACIDS RESEARCH, 2002, 30 (01) :235-238
[13]  
Fang Yu, 2006, ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS 2006), P93, DOI 10.1109/ANCS.2006.4579527
[14]  
Gormley C, 2015, Elasticsearch: The Definitive Guide
[15]   SigMatch: Fast and Scalable Multi-Pattern Matching [J].
Kandhan, Ramakrishnan ;
Teletia, Nikhil ;
Patel, Jignesh M. .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01) :1173-1184
[16]  
Kim J, 2016, PROC INT CONF DATA, P169, DOI 10.1109/ICDE.2016.7498238
[17]   List Intersection for Web Search: Algorithms, Cost Models, and Optimizations [J].
Kim, Sunghwan ;
Lee, Taesung ;
Hwang, Seung-won ;
Elnikety, Sameh .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 12 (01) :1-13
[18]   Efficient processing of substring match queries with inverted variable-length gram indexes [J].
Kim, Younghoon ;
Park, Hyoungmin ;
Shim, Kyuseok ;
Woo, Kyoung-Gu .
INFORMATION SCIENCES, 2013, 244 :119-141
[19]   Efficient Processing of Substring Match Queries with Inverted q-gram Indexes [J].
Kim, Younghoon ;
Woo, Kyoung-Gu ;
Park, Hyoungmin ;
Shim, Kyuseok .
26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, :721-732
[20]   Efficient online index construction for text databases [J].
Lester, Nicholas ;
Moffat, Alistair ;
Zobel, Justin .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2008, 33 (03)