Efficient Regular Expression Matching Based on Positional Inverted Index

被引：2

作者：

Qiu, Tao ^{[1
]}

Yang, Xiaochun ^{[1
,2
]}

Wang, Bin ^{[1
]}

Wang, Wei ^{[3
]}

机构：

[1] Northeastern Univ, Sch Comp Sci & Engn, Shenyang 110819, Liaoning, Peoples R China

[2] Beijing Inst Technol, Beijing 100811, Peoples R China

[3] Univ New South Wales, Sch Comp Sci & Engn, Sydney, NSW 2052, Australia

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2022年 / 34卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Regular expression matching; positional inverted index; query plan; DATABASE;

D O I：

10.1109/TKDE.2020.2992295

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the efficient regular expression (regex) matching problem. Existing algorithms are the scanning-based algorithms which typically use an equivalent automaton compiled from the regex query to verify a document. Although some works propose various strategies to quickly jump to candidate locations in a document where a query result may appear, they still need to utilize the scanning-based method to verify these candidate locations. These methods become inefficient when there are still many candidate locations needed to be verified. In this article, we propose a novel approach to efficiently compute all matching positions for a regex query purely based on a positional q-gram inverted index. We propose a gram-driven NFA to represent the language of a regex and show all regex matching locations can be obtained by finding positions on q-grams of GNFA that satisfy certain positional constraints. Then we propose several GNFA-based query plans to answer the query using the positional inverted index. In order to improve the query efficiency, we design the algorithm to build a tree-based query plan by carefully choosing a checking order for positional constraints. Experimental results on real-world datasets show that our method outperforms state-of-the-art methods by up to an order of magnitude in query efficiency.

引用

页码：1133 / 1148

页数：16

共 38 条

[11] Fast Set Intersection in Memory [J].

Ding, Bolin ;

Koenig, Arnd Christian .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (04) :255-266

[12] The PROSITE database, its status in 2002 [J].

Falquet, L ;

Pagni, M ;

Bucher, P ;

Hulo, N ;

Sigrist, CJA ;

Hofmann, K ;

Bairoch, A .

NUCLEIC ACIDS RESEARCH, 2002, 30 (01) :235-238

[13]

Fang Yu, 2006, ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS 2006), P93, DOI 10.1109/ANCS.2006.4579527

[14]

Gormley C, 2015, Elasticsearch: The Definitive Guide

[15] SigMatch: Fast and Scalable Multi-Pattern Matching [J].

Kandhan, Ramakrishnan ;

Teletia, Nikhil ;

Patel, Jignesh M. .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01) :1173-1184

[16]

Kim J, 2016, PROC INT CONF DATA, P169, DOI 10.1109/ICDE.2016.7498238

[17] List Intersection for Web Search: Algorithms, Cost Models, and Optimizations [J].

Kim, Sunghwan ;

Lee, Taesung ;

Hwang, Seung-won ;

Elnikety, Sameh .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 12 (01) :1-13

[18] Efficient processing of substring match queries with inverted variable-length gram indexes [J].

Kim, Younghoon ;

Park, Hyoungmin ;

Shim, Kyuseok ;

Woo, Kyoung-Gu .

INFORMATION SCIENCES, 2013, 244 :119-141

[19] Efficient Processing of Substring Match Queries with Inverted q-gram Indexes [J].

Kim, Younghoon ;

Woo, Kyoung-Gu ;

Park, Hyoungmin ;

Shim, Kyuseok .

26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, :721-732

[20] Efficient online index construction for text databases [J].

Lester, Nicholas ;

Moffat, Alistair ;

Zobel, Justin .

ACM TRANSACTIONS ON DATABASE SYSTEMS, 2008, 33 (03)

← 1 2 3 4 →