Informed and automated k-mer size selection for genome assembly

被引:530
作者
Chikhi, Rayan [1 ]
Medvedev, Paul [1 ,2 ]
机构
[1] Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
[2] Penn State Univ, Dept Biochem & Mol Biol, University Pk, PA 16802 USA
关键词
BACTERIAL GENOMES; SINGLE-CELL;
D O I
10.1093/bioinformatics/btt310
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. Results: We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing data-sets and find that its choice of k leads to some of the best assemblies.
引用
收藏
页码:31 / 37
页数:7
相关论文
共 20 条
[1]   Limitations of next-generation genome sequence assembly [J].
Alkan, Can ;
Sajjadian, Saba ;
Eichler, Evan E. .
NATURE METHODS, 2011, 8 (01) :61-65
[2]  
[Anonymous], 2005, P 2005 ACM SIGMOD IN, DOI DOI 10.1145/1066157.1066161
[3]  
[Anonymous], 2007, Numerical Recipes
[4]   SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing [J].
Bankevich, Anton ;
Nurk, Sergey ;
Antipov, Dmitry ;
Gurevich, Alexey A. ;
Dvorkin, Mikhail ;
Kulikov, Alexander S. ;
Lesin, Valery M. ;
Nikolenko, Sergey I. ;
Son Pham ;
Prjibelski, Andrey D. ;
Pyshkin, Alexey V. ;
Sirotkin, Alexander V. ;
Vyahhi, Nikolay ;
Tesler, Glenn ;
Alekseyev, Max A. ;
Pevzner, Pavel A. .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2012, 19 (05) :455-477
[5]  
Bradnam K.R., 2013, ARXIV13015406
[6]   Short read fragment assembly of bacterial genomes [J].
Chaisson, Mark J. ;
Pevzner, Pavel A. .
GENOME RESEARCH, 2008, 18 (02) :324-330
[7]  
Chikhi Rayan, 2012, Algorithms in Bioinformatics. Proceedings of the12th International Workshop, WABI 2012, P236, DOI 10.1007/978-3-642-33122-0_19
[8]   Efficient de novo assembly of single-cell bacterial genomes from short-read data sets [J].
Chitsaz, Hamidreza ;
Yee-Greenbaum, Joyclyn L. ;
Tesler, Glenn ;
Lombardo, Mary-Jane ;
Dupont, Christopher L. ;
Badger, Jonathan H. ;
Novotny, Mark ;
Rusch, Douglas B. ;
Fraser, Louise J. ;
Gormley, Niall A. ;
Schulz-Trieglaff, Ole ;
Smith, Geoffrey P. ;
Evers, Dirk J. ;
Pevzner, Pavel A. ;
Lasken, Roger S. .
NATURE BIOTECHNOLOGY, 2011, 29 (10) :915-U214
[9]   Assemblathon 1: A competitive assessment of de novo short read assembly methods [J].
Earl, Dent ;
Bradnam, Keith ;
St John, John ;
Darling, Aaron ;
Lin, Dawei ;
Fass, Joseph ;
Hung On Ken Yu ;
Buffalo, Vince ;
Zerbino, Daniel R. ;
Diekhans, Mark ;
Ngan Nguyen ;
Ariyaratne, Pramila Nuwantha ;
Sung, Wing-Kin ;
Ning, Zemin ;
Haimel, Matthias ;
Simpson, Jared T. ;
Fonseca, Nuno A. ;
Birol, Inanc ;
Docking, T. Roderick ;
Ho, Isaac Y. ;
Rokhsar, Daniel S. ;
Chikhi, Rayan ;
Lavenier, Dominique ;
Chapuis, Guillaume ;
Naquin, Delphine ;
Maillet, Nicolas ;
Schatz, Michael C. ;
Kelley, David R. ;
Phillippy, Adam M. ;
Koren, Sergey ;
Yang, Shiaw-Pyng ;
Wu, Wei ;
Chou, Wen-Chi ;
Srivastava, Anuj ;
Shaw, Timothy I. ;
Ruby, J. Graham ;
Skewes-Cox, Peter ;
Betegon, Miguel ;
Dimon, Michelle T. ;
Solovyev, Victor ;
Seledtsov, Igor ;
Kosarev, Petr ;
Vorobyev, Denis ;
Ramirez-Gonzalez, Ricardo ;
Leggett, Richard ;
MacLean, Dan ;
Xia, Fangfang ;
Luo, Ruibang ;
Li, Zhenyu ;
Xie, Yinlong .
GENOME RESEARCH, 2011, 21 (12) :2224-2241
[10]   QUAST: quality assessment tool for genome assemblies [J].
Gurevich, Alexey ;
Saveliev, Vladislav ;
Vyahhi, Nikolay ;
Tesler, Glenn .
BIOINFORMATICS, 2013, 29 (08) :1072-1075