Efficient Regular Expression Matching Based on Positional Inverted Index

被引：2

作者：

Qiu, Tao ^{[1
]}

Yang, Xiaochun ^{[1
,2
]}

Wang, Bin ^{[1
]}

Wang, Wei ^{[3
]}

机构：

[1] Northeastern Univ, Sch Comp Sci & Engn, Shenyang 110819, Liaoning, Peoples R China

[2] Beijing Inst Technol, Beijing 100811, Peoples R China

[3] Univ New South Wales, Sch Comp Sci & Engn, Sydney, NSW 2052, Australia

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2022年 / 34卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Regular expression matching; positional inverted index; query plan; DATABASE;

D O I：

10.1109/TKDE.2020.2992295

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the efficient regular expression (regex) matching problem. Existing algorithms are the scanning-based algorithms which typically use an equivalent automaton compiled from the regex query to verify a document. Although some works propose various strategies to quickly jump to candidate locations in a document where a query result may appear, they still need to utilize the scanning-based method to verify these candidate locations. These methods become inefficient when there are still many candidate locations needed to be verified. In this article, we propose a novel approach to efficiently compute all matching positions for a regex query purely based on a positional q-gram inverted index. We propose a gram-driven NFA to represent the language of a regex and show all regex matching locations can be obtained by finding positions on q-grams of GNFA that satisfy certain positional constraints. Then we propose several GNFA-based query plans to answer the query using the positional inverted index. In order to improve the query efficiency, we design the algorithm to build a tree-based query plan by carefully choosing a checking order for positional constraints. Experimental results on real-world datasets show that our method outperforms state-of-the-art methods by up to an order of magnitude in query efficiency.

引用

页码：1133 / 1148

页数：16

共 38 条

[1]

[Anonymous], 2008, Introduction to Information Retrieval. Plus 0.5em Minus 0.4em

[2]

BaezaYates R, 2004, LECT NOTES COMPUT SC, V3109, P400

[3] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 [J].

Bairoch, A ;

Apweiler, R .

NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :45-48

[4] Scalable Lookahead Regular Expression Detection System for Deep Packet Inspection [J].

Bando, Masanori ;

Artan, N. Sertac ;

Chao, H. Jonathan .

IEEE-ACM TRANSACTIONS ON NETWORKING, 2012, 20 (03) :699-714

[5]

Barbay J., 2009, JEA, V14, P7

[6] FROM REGULAR EXPRESSIONS TO DETERMINISTIC AUTOMATA [J].

BERRY, G ;

SETHI, R .

THEORETICAL COMPUTER SCIENCE, 1986, 48 (01) :117-126

[7]

Chang A X., 2014, 201402 CSTR STANF U

[8] A fast regular expression indexing engine [J].

Cho, JH ;

Rajagopalan, S .

18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, :419-430

[9] Efficient Set Intersection for Inverted Indexing [J].

Culpepper, J. Shane ;

Moffat, Alistair .

ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2010, 29 (01)

[10]

DeRose Pedro., 2007, CIDR, P169

← 1 2 3 4 →