Practical High-Order Entropy-Compressed Text Self-Indexing

被引：3

作者：

Huo, Hongwei ^{[1
]}

Long, Peng ^{[2
]}

Vitter, Jeffrey Scott ^{[3
,4
]}

机构：

[1] Xidian Univ, Comp Sci & Technol, Xian 710071, Shaanxi, Peoples R China

[2] Xidian Univ, Xian 710071, Shaanxi, Peoples R China

[3] Univ Mississippi, Oxford, MS 38677 USA

[4] Tulane Univ, New Orleans, LA 70118 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2023年 / 35卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Entropy; Arrays; Indexing; Bioinformatics; Pattern matching; Genomics; Encoding; Text indexing; text retrieval; entropy-compressed; compressed data structures; pattern matching; SUFFIX ARRAYS; CONSTRUCTION; TREE;

D O I：

10.1109/TKDE.2021.3114401

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Compressed self-indexes are used widely in string processing applications, such as information retrieval, genome analysis, data mining, and web searching. The index not only indexes the data, but also encodes the data, and it is in compressed form. Moreover, the index and the data it encodes can be operated upon directly, without need to uncompress the entire index, thus saving time while maintaining small storage space. In some applications, such as in genome analysis, existing methods do not exploit the full possibilities of compressed self-indexes, and thus we seek faster and more space-efficient indexes. In this paper, we propose a practical high-order entropy-compressed self-index for efficient pattern matching in a text. We give practical implementations of compressed suffix arrays using a hybrid encoding in the representation of the neighbor function F. We analyze the performance in theory and practice of our recommended indexing method, called GeCSA. We can improve retrieval time further using an iterated version of the neighbor function. Experimental results on the tested data demonstrate that the proposed index GeCSA has good overall advantages in space usage and retrieval time over the state-of-the-art indexing methods, especially on the repetitive data.

引用

页码：2943 / 2960

页数：18

共 50 条

[1] High-order entropy-compressed text indexes
Grossi, R
Gupta, A
Vitter, JS
PROCEEDINGS OF THE FOURTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2003, : 841 - 850
[2] On Entropy-Compressed Text Indexing in External Memory
Hon, Wing-Kai
Shah, Rahul
Thankachan, Sharma V.
Vitter, Jeffrey Scott
STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5721 : 75 - +
[3] A Practical Implementation of Compressed Suffix Arrays with Applications to Self-Indexing
Huo, Hongwei
Chen, Longgang
Vitter, Jeffrey Scott
Nekrich, Yakov
2014 DATA COMPRESSION CONFERENCE (DCC 2014), 2014, : 292 - 301
[4] Practical Dynamic Entropy-Compressed Bitvectors with Applications
Cordova, Joshimar
Navarro, Gonzalo
EXPERIMENTAL ALGORITHMS, SEA 2016, 2016, 9685 : 105 - 117
[5] Practical Entropy-Compressed Rank/Select Dictionary
Okanohara, Daisuke
Sadakane, Kunihiko
PROCEEDINGS OF THE NINTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FOURTH WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS, 2007, : 60 - +
[6] Dynamic Entropy-Compressed Sequences and Full-Text Indexes
Maekinen, Veli
Navarro, Gonzalo
ACM TRANSACTIONS ON ALGORITHMS, 2008, 4 (03)
[7] Dynamic entropy-compressed sequences and full-text indexes
Makinen, Veli
Navarro, Gonzalo
COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2006, 4009 : 306 - 317
[8] Self-indexing inverted files for fast text retrieval
Moffat, A
Zobel, J
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1996, 14 (04) : 349 - 379
[9] Combining Text Compression and String Matching: The Miracle of Self-Indexing
Navarro, Conzalo
PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2009, 2009, : 1 - 1
[10] Improved self-indexing inverted files for full-text retrieval
College of Compute Science, South-Central University for Nationalities, Wuhan 430074, China
不详
J. Comput. Inf. Syst., 2009, 2 (1017-1024):

← 1 2 3 4 5 →