DNA sequence compression using the Burrows-Wheeler Transform

被引:28
作者
Adjeroh, D [1 ]
Zhang, Y [1 ]
Mukherjee, A [1 ]
Powell, M [1 ]
Bell, T [1 ]
机构
[1] W Virginia Univ, Lane Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
来源
CSB2002: IEEE COMPUTER SOCIETY BIOINFORMATICS CONFERENCE | 2002年
关键词
DNA sequence compression; repetition structures; Burrows-Wheeler Transform; BWT;
D O I
10.1109/CSB.2002.1039352
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We investigate. off-line dictionary oriented approaches to DNA sequence compression, based on the Burrows-Wheeler Transform (BWT). The preponderance of short repeating patterns is an important phenomenon in biological sequences. Here, we propose off-line methods to compress DNA sequences that exploit the different repetition structures inherent in such sequences. Repetition analysis is performed based on the relationship between the BWT and important pattern matching data structures, such as the suffix tree and suffix array. We discuss how the proposed approach can be incorporated in the BWT compression pipeline.
引用
收藏
页码:303 / 313
页数:11
相关论文
共 36 条
[1]  
ADJEROH DA, 2002, P IEEE DAT COMPR C
[2]   Off-line compression by greedy textual substitution [J].
Apostolico, A ;
Lonardi, S .
PROCEEDINGS OF THE IEEE, 2000, 88 (11) :1733-1744
[3]   ROBUST TRANSMISSION OF UNBOUNDED STRINGS USING FIBONACCI REPRESENTATIONS [J].
APOSTOLICO, A ;
FRAENKEL, AS .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1987, 33 (02) :238-245
[4]  
APOSTOLICO A, 2001, P IEEE DAT COMPR C
[5]   Computer simulation of expansions of DNA triplet repeats in the fragile X syndrome and Huntington's disease [J].
Bat, O ;
Kimmel, M ;
Axelrod, DE .
JOURNAL OF THEORETICAL BIOLOGY, 1997, 188 (01) :53-67
[6]  
Bell T. C., 1990, TEXT COMPRESSION
[7]  
BELL TC, 2002, P IEEE DAT COMPR C
[8]   A LOCALLY ADAPTIVE DATA-COMPRESSION SCHEME [J].
BENTLEY, JL ;
SLEATOR, DD ;
TARJAN, RE ;
WEI, VK .
COMMUNICATIONS OF THE ACM, 1986, 29 (04) :320-330
[9]  
Burrows M., 1994, 124 DIG EQ CORP
[10]  
CHEN X, 1999, P 10 WORKSH GEN INF, P52