Starcode: sequence clustering based on all-pairs search

被引:119
作者
Zorita, Eduard [1 ,2 ]
Cusco, Pol [1 ,2 ]
Filion, Guillaume J. [1 ,2 ]
机构
[1] Ctr Genom Regulat CRG, Genome Architecture Gene Regulat Stem Cells & Can, Barcelona 08003, Spain
[2] UPF, Barcelona 08002, Spain
关键词
PROTEINS; READS;
D O I
10.1093/bioinformatics/btv053
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA barcodes. In this case, error correction can be done by performing a pairwise comparison of all the barcodes, which is a computationally complex problem. Results: Here, we address this challenge and describe an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman-Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision.
引用
收藏
页码:1913 / 1919
页数:7
相关论文
共 18 条
[1]   Chromatin Position Effects Assayed by Thousands of Reporters Integrated in Parallel [J].
Akhtar, Waseem ;
de Jong, Johann ;
Pindyurin, Alexey V. ;
Pagie, Ludo ;
Meuleman, Wouter ;
de Ridder, Jeroen ;
Berns, Anton ;
Wessels, Lodewyk F. A. ;
van Lohuizen, Maarten ;
van Steensel, Bas .
CELL, 2013, 154 (04) :914-927
[2]   SEED: efficient clustering of next-generation sequences [J].
Bao, Ergude ;
Jiang, Tao ;
Kaloshian, Isgouhi ;
Girke, Thomas .
BIOINFORMATICS, 2011, 27 (18) :2502-2509
[3]   Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads [J].
Chong, Zechen ;
Ruan, Jue ;
Wu, Chung-I. .
BIOINFORMATICS, 2012, 28 (21) :2732-2737
[4]   RRM-RNA recognition: NMR or crystallography ... and new findings [J].
Daubner, Gerrit M. ;
Clery, Antoine ;
Allain, Frederic H-T .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 2013, 23 (01) :100-108
[5]   Substantial biases in ultra-short read data sets from high-throughput DNA sequencing [J].
Dohm, Juliane C. ;
Lottaz, Claudio ;
Borodina, Tatiana ;
Himmelbauer, Heinz .
NUCLEIC ACIDS RESEARCH, 2008, 36 (16)
[6]   Real-Time DNA Sequencing from Single Polymerase Molecules [J].
Eid, John ;
Fehr, Adrian ;
Gray, Jeremy ;
Luong, Khai ;
Lyle, John ;
Otto, Geoff ;
Peluso, Paul ;
Rank, David ;
Baybayan, Primo ;
Bettman, Brad ;
Bibillo, Arkadiusz ;
Bjornson, Keith ;
Chaudhuri, Bidhan ;
Christians, Frederick ;
Cicero, Ronald ;
Clark, Sonya ;
Dalal, Ravindra ;
deWinter, Alex ;
Dixon, John ;
Foquet, Mathieu ;
Gaertner, Alfred ;
Hardenbol, Paul ;
Heiner, Cheryl ;
Hester, Kevin ;
Holden, David ;
Kearns, Gregory ;
Kong, Xiangxu ;
Kuse, Ronald ;
Lacroix, Yves ;
Lin, Steven ;
Lundquist, Paul ;
Ma, Congcong ;
Marks, Patrick ;
Maxham, Mark ;
Murphy, Devon ;
Park, Insil ;
Pham, Thang ;
Phillips, Michael ;
Roy, Joy ;
Sebra, Robert ;
Shen, Gene ;
Sorenson, Jon ;
Tomaney, Austin ;
Travers, Kevin ;
Trulson, Mark ;
Vieceli, John ;
Wegener, Jeffrey ;
Wu, Dawn ;
Yang, Alicia ;
Zaccarin, Denis .
SCIENCE, 2009, 323 (5910) :133-138
[7]   CD-HIT: accelerated for clustering the next-generation sequencing data [J].
Fu, Limin ;
Niu, Beifang ;
Zhu, Zhengwei ;
Wu, Sitao ;
Li, Weizhong .
BIOINFORMATICS, 2012, 28 (23) :3150-3152
[8]  
MacKay David J. C., 2002, Information Theory, Inference Learning Algorithms
[9]   Genome sequencing in microfabricated high-density picolitre reactors [J].
Margulies, M ;
Egholm, M ;
Altman, WE ;
Attiya, S ;
Bader, JS ;
Bemben, LA ;
Berka, J ;
Braverman, MS ;
Chen, YJ ;
Chen, ZT ;
Dewell, SB ;
Du, L ;
Fierro, JM ;
Gomes, XV ;
Godwin, BC ;
He, W ;
Helgesen, S ;
Ho, CH ;
Irzyk, GP ;
Jando, SC ;
Alenquer, MLI ;
Jarvie, TP ;
Jirage, KB ;
Kim, JB ;
Knight, JR ;
Lanza, JR ;
Leamon, JH ;
Lefkowitz, SM ;
Lei, M ;
Li, J ;
Lohman, KL ;
Lu, H ;
Makhijani, VB ;
McDade, KE ;
McKenna, MP ;
Myers, EW ;
Nickerson, E ;
Nobile, JR ;
Plant, R ;
Puc, BP ;
Ronan, MT ;
Roth, GT ;
Sarkis, GJ ;
Simons, JF ;
Simpson, JW ;
Srinivasan, M ;
Tartaro, KR ;
Tomasz, A ;
Vogt, KA ;
Volkmer, GA .
NATURE, 2005, 437 (7057) :376-380
[10]   Sequence-specific error profile of Illumina sequencers [J].
Nakamura, Kensuke ;
Oshima, Taku ;
Morimoto, Takuya ;
Ikeda, Shun ;
Yoshikawa, Hirofumi ;
Shiwa, Yuh ;
Ishikawa, Shu ;
Linak, Margaret C. ;
Hirai, Aki ;
Takahashi, Hiroki ;
Altaf-Ul-Amin, Md. ;
Ogasawara, Naotake ;
Kanaya, Shigehiko .
NUCLEIC ACIDS RESEARCH, 2011, 39 (13) :e90