Hobbes: optimized gram-based methods for efficient read alignment

被引:44
作者
Ahmadi, Athena [1 ]
Behm, Alexander [1 ]
Honnalli, Nagesh [1 ]
Li, Chen [1 ]
Weng, Lingjie [1 ]
Xie, Xiaohui [1 ,2 ]
机构
[1] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92697 USA
[2] Univ Calif Irvine, Inst Genom & Bioinformat, Irvine, CA 92697 USA
基金
美国国家科学基金会;
关键词
SEQUENCES; ULTRAFAST; TOOL;
D O I
10.1093/nar/gkr1246
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Recent advances in sequencing technology have enabled the rapid generation of billions of bases at relatively low cost. A crucial first step in many sequencing applications is to map those reads to a reference genome. However, when the reference genome is large, finding accurate mappings poses a significant computational challenge due to the sheer amount of reads, and because many reads map to the reference sequence approximately but not exactly. We introduce Hobbes, a new gram-based program for aligning short reads, supporting Hamming and edit distance. Hobbes implements two novel techniques, which yield substantial performance improvements: an optimized gram-selection procedure for reads, and a cache-efficient filter for pruning candidate mappings. We systematically tested the performance of Hobbes on both real and simulated data with read lengths varying from 35 to 100 bp, and compared its performance with several state-of-the-art read-mapping programs, including Bowtie, BWA, mrsFast and RazerS. Hobbes is faster than all other read mapping programs we have tested while maintaining high mapping quality. Hobbes is about five times faster than Bowtie and about 2-10 times faster than BWA, depending on read length and error rate, when asked to find all mapping locations of a read in the human genome within a given Hamming or edit distance, respectively. Hobbes supports the SAM output format and is publicly available at http://hobbes.ics.uci.edu.
引用
收藏
页数:11
相关论文
共 29 条
  • [1] Personalized copy number and segmental duplication maps using next-generation sequencing
    Alkan, Can
    Kidd, Jeffrey M.
    Marques-Bonet, Tomas
    Aksay, Gozde
    Antonacci, Francesca
    Hormozdiari, Fereydoun
    Kitzman, Jacob O.
    Baker, Carl
    Malig, Maika
    Mutlu, Onur
    Sahinalp, S. Cenk
    Gibbs, Richard A.
    Eichler, Evan E.
    [J]. NATURE GENETICS, 2009, 41 (10) : 1061 - U29
  • [2] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [3] Bauer M.J., 2010, P 18 ANN C INT SYST
  • [4] Burkhardt S, 2003, FUND INFORM, V56, P51
  • [5] Burrows M, 1994, BLOCK SORTING LOSSLE
  • [6] Chaudhuri S., 2006, ICDE, DOI [10.1109/ICDE.2006.9, DOI 10.1109/ICDE.2006.9]
  • [7] Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data
    Chung, Dongjun
    Kuan, Pei Fen
    Li, Bo
    Sanalkumar, Rajendran
    Liang, Kun
    Bresnick, Emery H.
    Dewey, Colin
    Keles, Suenduez
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2011, 7 (07)
  • [8] RATES OF TRANSITION AND TRANSVERSION IN CODING SEQUENCES SINCE THE HUMAN-RODENT DIVERGENCE
    COLLINS, DW
    JUKES, TH
    [J]. GENOMICS, 1994, 20 (03) : 386 - 396
  • [9] SeqAn An efficient, generic C++ library for sequence analysis
    Doering, Andreas
    Weese, David
    Rausch, Tobias
    Reinert, Knut
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [10] Ferragina P, 2001, SIAM PROC S, P269