A new algorithm for "the LCS problem" with application in compressing genome resequencing data

被引:12
作者
Beal, Richard [1 ]
Afrin, Tazin [1 ]
Farheen, Aliya [1 ]
Adjeroh, Donald [1 ]
机构
[1] W Virginia Univ, Lane Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
来源
BMC GENOMICS | 2016年 / 17卷
基金
美国国家科学基金会;
关键词
Longest common subsequence; LCS; Longest previous factor; LPF; Compression; Biology; Genome resequencing; TEXTUAL DATA-COMPRESSION; COMPUTATIONAL BIOLOGY; LONGEST; PROTEIN;
D O I
10.1186/s12864-016-2793-0
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. Methods: First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself. Results: Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011). Conclusion: We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method.
引用
收藏
页数:13
相关论文
共 33 条
  • [1] Aach John., 2001, Nature, V26, P5
  • [2] Adjeroh D., 2008, The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
  • [3] Adjeroh D, 2006, IEEE DATA COMPR CONF, P422
  • [4] THE BOYER-MOORE-GALIL STRING SEARCHING STRATEGIES REVISITED
    APOSTOLICO, A
    GIANCARLO, R
    [J]. SIAM JOURNAL ON COMPUTING, 1986, 15 (01) : 98 - 105
  • [5] Beal R, 2015, IEEE INT C BIOINFORM, P69, DOI 10.1109/BIBM.2015.7359657
  • [6] Variations of the parameterized longest previous factor
    Beal, Richard
    Adjeroh, Donald
    [J]. JOURNAL OF DISCRETE ALGORITHMS, 2012, 16 : 129 - 150
  • [7] Parameterized longest previous factor
    Beal, Richard
    Adjeroh, Donald
    [J]. THEORETICAL COMPUTER SCIENCE, 2012, 437 : 21 - 34
  • [8] CORMEN TH, 2001, INTRO ALGORITHMS
  • [9] Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform
    Cox, Anthony J.
    Bauer, Markus J.
    Jakobi, Tobias
    Rosone, Giovanna
    [J]. BIOINFORMATICS, 2012, 28 (11) : 1415 - 1419
  • [10] A simple algorithm for computing the Lempel-Ziv factorization
    Crochemore, Maxime
    Ilie, Lucian
    Smyth, W. F.
    [J]. DCC: 2008 DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2008, : 482 - +