Suffix-Sorting via Shannon-Fano-Elias Codes

被引:10
作者
Adjeroh, Donald [1 ]
Nan, Fei [1 ]
机构
[1] West Virginia Univ, Lane Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
关键词
suffix sorting; suffix arrays; suffix tree; Shannon-Fano-Elias codes; source coding;
D O I
10.3390/a3020145
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a sequence T = t(0)t(1) ... t(n-1) of size n = |T|, with symbols from a fixed alphabet Sigma, (|S| <= n), the suffix array provides a listing of all the suffixes of T in a lexicographic order. Given T, the suffix sorting problem is to construct its suffix array. The direct suffix sorting problem is to construct the suffix array of T directly without using the suffix tree data structure. While algorithims for linear time, linear space direct suffix sorting have been proposed, the actual constant in the linear space is still a major concern, given that the applications of suffix trees and suffix arrays (such as in whole-genome analysis) often involve huge data sets. In this work, we reduce the gap between current results and the minimal space requirement. We introduce an algorithm for the direct suffix sorting problem with worst case time complexity in O(n), requiring only (1(sic)n log n - n log |Sigma| + O(1)) bits in memory space. This implies 5(sic)n | O(1) bytes for total space requirment, (including space for both the output suffix array and the input sequence T) assuming n <= 2(32), |Sigma| <= 256, and 4 bytes per integer. The basis of our algorithm is an extension of Shannon-Fano-Elias codes used in source coding and information theory. This is the first time information-theoretic methods have been used as the basis for solving the suffix sorting problem.
引用
收藏
页码:145 / 167
页数:23
相关论文
共 35 条
[1]  
Abouelhoda M. I., 2004, Journal of Discrete Algorithms, V2, P53, DOI 10.1016/S1570-8667(03)00065-0
[2]  
[Anonymous], 2008, BURROWS WHEELER TRAN, DOI [DOI 10.1007/978-0-387-78909-5, 10.1007/978-0-387-78909-5]
[3]  
Bell T. C., 1990, TEXT COMPRESSION
[4]  
Burkhardt S., 2003, P 14 ANN ACM SIAM S
[5]  
Burrows M, 1994, BLOCK SORTING LOSSLE
[6]   Unbounded length contexts for PPM [J].
Cleary, JG ;
Teahan, WJ .
COMPUTER JOURNAL, 1997, 40 (2-3) :67-75
[7]  
Cover T. M., 2006, ELEMENTS INFORM THEO, DOI [DOI 10.1002/047174882X, DOI 10.1002/047174882X.CH5]
[8]   On the sorting-complexity of suffix tree construction [J].
Farach-Colton, M ;
Ferragina, P ;
Muthukrishnan, S .
JOURNAL OF THE ACM, 2000, 47 (06) :987-1011
[9]   Opportunistic data structures with applications [J].
Ferragina, P ;
Manzini, G .
41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, :390-398
[10]   ORDER-PRESERVING MINIMAL PERFECT HASH FUNCTIONS AND INFORMATION-RETRIEVAL [J].
FOX, EA ;
QI, FC ;
DAOUD, AM ;
HEATH, LS .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1991, 9 (03) :281-308