Practical High-Order Entropy-Compressed Text Self-Indexing

被引:3
|
作者
Huo, Hongwei [1 ]
Long, Peng [2 ]
Vitter, Jeffrey Scott [3 ,4 ]
机构
[1] Xidian Univ, Comp Sci & Technol, Xian 710071, Shaanxi, Peoples R China
[2] Xidian Univ, Xian 710071, Shaanxi, Peoples R China
[3] Univ Mississippi, Oxford, MS 38677 USA
[4] Tulane Univ, New Orleans, LA 70118 USA
基金
中国国家自然科学基金;
关键词
Entropy; Arrays; Indexing; Bioinformatics; Pattern matching; Genomics; Encoding; Text indexing; text retrieval; entropy-compressed; compressed data structures; pattern matching; SUFFIX ARRAYS; CONSTRUCTION; TREE;
D O I
10.1109/TKDE.2021.3114401
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Compressed self-indexes are used widely in string processing applications, such as information retrieval, genome analysis, data mining, and web searching. The index not only indexes the data, but also encodes the data, and it is in compressed form. Moreover, the index and the data it encodes can be operated upon directly, without need to uncompress the entire index, thus saving time while maintaining small storage space. In some applications, such as in genome analysis, existing methods do not exploit the full possibilities of compressed self-indexes, and thus we seek faster and more space-efficient indexes. In this paper, we propose a practical high-order entropy-compressed self-index for efficient pattern matching in a text. We give practical implementations of compressed suffix arrays using a hybrid encoding in the representation of the neighbor function F. We analyze the performance in theory and practice of our recommended indexing method, called GeCSA. We can improve retrieval time further using an iterated version of the neighbor function. Experimental results on the tested data demonstrate that the proposed index GeCSA has good overall advantages in space usage and retrieval time over the state-of-the-art indexing methods, especially on the repetitive data.
引用
收藏
页码:2943 / 2960
页数:18
相关论文
共 50 条
  • [1] High-order entropy-compressed text indexes
    Grossi, R
    Gupta, A
    Vitter, JS
    PROCEEDINGS OF THE FOURTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2003, : 841 - 850
  • [2] On Entropy-Compressed Text Indexing in External Memory
    Hon, Wing-Kai
    Shah, Rahul
    Thankachan, Sharma V.
    Vitter, Jeffrey Scott
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5721 : 75 - +
  • [3] A Practical Implementation of Compressed Suffix Arrays with Applications to Self-Indexing
    Huo, Hongwei
    Chen, Longgang
    Vitter, Jeffrey Scott
    Nekrich, Yakov
    2014 DATA COMPRESSION CONFERENCE (DCC 2014), 2014, : 292 - 301
  • [4] Practical Dynamic Entropy-Compressed Bitvectors with Applications
    Cordova, Joshimar
    Navarro, Gonzalo
    EXPERIMENTAL ALGORITHMS, SEA 2016, 2016, 9685 : 105 - 117
  • [5] Practical Entropy-Compressed Rank/Select Dictionary
    Okanohara, Daisuke
    Sadakane, Kunihiko
    PROCEEDINGS OF THE NINTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FOURTH WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS, 2007, : 60 - +
  • [6] Dynamic Entropy-Compressed Sequences and Full-Text Indexes
    Maekinen, Veli
    Navarro, Gonzalo
    ACM TRANSACTIONS ON ALGORITHMS, 2008, 4 (03)
  • [7] Dynamic entropy-compressed sequences and full-text indexes
    Makinen, Veli
    Navarro, Gonzalo
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2006, 4009 : 306 - 317
  • [8] Self-indexing inverted files for fast text retrieval
    Moffat, A
    Zobel, J
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1996, 14 (04) : 349 - 379
  • [9] Combining Text Compression and String Matching: The Miracle of Self-Indexing
    Navarro, Conzalo
    PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2009, 2009, : 1 - 1
  • [10] Improved self-indexing inverted files for full-text retrieval
    College of Compute Science, South-Central University for Nationalities, Wuhan 430074, China
    不详
    J. Comput. Inf. Syst., 2009, 2 (1017-1024):