Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment

被引:5
作者
Nagar, Anurag [1 ]
Hahsler, Michael [2 ]
机构
[1] So Methodist Univ, Dept Comp Sci & Engn, Dallas, TX 75205 USA
[2] So Methodist Univ, Dept Engn Management Informat & Syst, Dallas, TX USA
来源
BMC BIOINFORMATICS | 2013年 / 14卷
关键词
MULTIPLE; IDENTIFICATION; COMPLEXITY; MUSCLE;
D O I
10.1186/1471-2105-14-S11-S2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient highthroughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasialignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. Results: In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Conclusion: Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA.
引用
收藏
页数:12
相关论文
共 35 条
  • [1] Aggarwal Charu C, 2007, Data Streams: Models and Algorithms, V31
  • [2] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [3] [Anonymous], 2012, GREENG WEBS 16S RRNA
  • [4] [Anonymous], 1966, SOVIET PHYS DOKLADY
  • [5] [Anonymous], 2011, P SIAM INT C DAT MIN
  • [6] Review and re-analysis of domain-specific 16S primers
    Baker, GC
    Smith, JJ
    Cowan, DA
    [J]. JOURNAL OF MICROBIOLOGICAL METHODS, 2003, 55 (03) : 541 - 555
  • [7] A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria
    Chakravorty, Soumitesh
    Helb, Danica
    Burday, Michele
    Connell, Nancy
    Alland, David
    [J]. JOURNAL OF MICROBIOLOGICAL METHODS, 2007, 69 (02) : 330 - 339
  • [8] The Jalview Java']Java alignment editor
    Clamp, M
    Cuff, J
    Searle, SM
    Barton, GJ
    [J]. BIOINFORMATICS, 2004, 20 (03) : 426 - 427
  • [9] DeSantis Todd Z., 2011, BMC Ecology, V11, P11, DOI 10.1186/1472-6785-11-11
  • [10] MUSCLE: a multiple sequence alignment method with reduced time and space complexity
    Edgar, RC
    [J]. BMC BIOINFORMATICS, 2004, 5 (1) : 1 - 19