These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

被引:46
作者
Zhang, Qingpeng [1 ]
Pell, Jason [1 ]
Canino-Koning, Rosangela [1 ]
Howe, Adina Chuang [2 ,3 ]
Brown, C. Titus [1 ,2 ]
机构
[1] Michigan State Univ, Dept Comp Sci & Engn, E Lansing, MI 48824 USA
[2] Michigan State Univ, Dept Microbiol & Mol Genet, E Lansing, MI 48824 USA
[3] Michigan State Univ, Dept Plant Soil & Microbial Sci, E Lansing, MI 48824 USA
来源
PLOS ONE | 2014年 / 9卷 / 07期
基金
美国农业部; 美国国家科学基金会; 美国国家卫生研究院;
关键词
GENERATION; SIZE;
D O I
10.1371/journal.pone.0101271
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.
引用
收藏
页数:13
相关论文
共 39 条
  • [1] [Anonymous], 2012, ARXIV12034802
  • [2] KAnalyze: a fast versatile pipelined K-mer toolkit
    Audano, Peter
    Vannberg, Fredrik
    [J]. BIOINFORMATICS, 2014, 30 (14) : 2070 - 2072
  • [3] SPACE/TIME TRADE/OFFS IN HASH CODING WITH ALLOWABLE ERRORS
    BLOOM, BH
    [J]. COMMUNICATIONS OF THE ACM, 1970, 13 (07) : 422 - &
  • [4] Broder Andrei, 2002, Internet mathematics, P636, DOI DOI 10.1080/15427951.2004.10129096
  • [5] Brown C.T., 2012, What does trinity's in silico normalization do?
  • [6] Informed and automated k-mer size selection for genome assembly
    Chikhi, Rayan
    Medvedev, Paul
    [J]. BIOINFORMATICS, 2014, 30 (01) : 31 - 37
  • [7] Efficient de novo assembly of single-cell bacterial genomes from short-read data sets
    Chitsaz, Hamidreza
    Yee-Greenbaum, Joyclyn L.
    Tesler, Glenn
    Lombardo, Mary-Jane
    Dupont, Christopher L.
    Badger, Jonathan H.
    Novotny, Mark
    Rusch, Douglas B.
    Fraser, Louise J.
    Gormley, Niall A.
    Schulz-Trieglaff, Ole
    Smith, Geoffrey P.
    Evers, Dirk J.
    Pevzner, Pavel A.
    Lasken, Roger S.
    [J]. NATURE BIOTECHNOLOGY, 2011, 29 (10) : 915 - U214
  • [8] Cohen S., 2003, P ACM INT C MAN DAT, P241, DOI [10.1145/872757.872787, DOI 10.1145/872757.872787]
  • [9] Succinct data structures for assembling large genomes
    Conway, Thomas C.
    Bromage, Andrew J.
    [J]. BIOINFORMATICS, 2011, 27 (04) : 479 - 486
  • [10] An improved data stream summary: the count-min sketch and its applications
    Cormode, G
    Muthukrishnan, S
    [J]. JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 2005, 55 (01): : 58 - 75