MSClust: A Multi-Seeds based Clustering algorithm for microbiome profiling using 16S rRNA sequence

被引:17
作者
Chen, Wei [1 ,2 ]
Cheng, Yongmei [1 ]
Zhang, Clarence [3 ]
Zhang, Shaowu [1 ]
Zhao, Hongyu [2 ]
机构
[1] Northwestern Polytech Univ, Coll Automat, Xian 710072, Peoples R China
[2] Yale Univ, Sch Publ Hlth, Dept Biostat, New Haven, CT 06510 USA
[3] Yale Univ, Sch Med, Keck Biotechnol Lab, New Haven, CT 06510 USA
基金
中国国家自然科学基金;
关键词
Clustering algorithms; Operational taxonomic unit (OTU); Next-generation sequencing; Seeds-selection; 16S rRNA reads; LARGE SETS; ACCURATE; PROGRAM;
D O I
10.1016/j.mimet.2013.07.004
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Recent developments of next generation sequencing technologies have led to rapid accumulation of 16S rRNA sequences for microbiome profiling. One key step in data processing is to cluster short sequences into operational taxonomic units (OTUs). Although many methods have been proposed for OTU inferences, a major challenge is the balance between inference accuracy and computational efficiency, where inference accuracy is often sacrificed to accommodate the need to analyze large numbers of sequences. Inspired by the hierarchical clustering method and a modified greedy network clustering algorithm, we propose a novel multi-seeds based heuristic clustering method, named MSClust, for OTU inference. MSClust first adaptively selects multi-seeds instead of one seed for each candidate cluster, and the reads are then processed using a greedy clustering strategy. Through many numerical examples, we demonstrate that MSClust enjoys less memory usage, and better biological accuracy compared to existing heuristic clustering methods while preserving efficiency and scalability. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:347 / 355
页数:9
相关论文
共 22 条
[1]  
Barriuso J., 2011, BMC BIOINFORMATICS, P12
[2]   ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time [J].
Cai, Yunpeng ;
Sun, Yijun .
NUCLEIC ACIDS RESEARCH, 2011, 39 (14) :e95
[3]   The Ribosomal Database Project: improved alignments and new tools for rRNA analysis [J].
Cole, J. R. ;
Wang, Q. ;
Cardenas, E. ;
Fish, J. ;
Chai, B. ;
Farris, R. J. ;
Kulam-Syed-Mohideen, A. S. ;
McGarrell, D. M. ;
Marsh, T. ;
Garrity, G. M. ;
Tiedje, J. M. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D141-D145
[4]   Search and clustering orders of magnitude faster than BLAST [J].
Edgar, Robert C. .
BIOINFORMATICS, 2010, 26 (19) :2460-2461
[5]   DNACLUST: accurate and efficient clustering of phylogenetic marker genes [J].
Ghodsi, Mohammadreza ;
Liu, Bo ;
Pop, Mihai .
BMC BIOINFORMATICS, 2011, 12
[6]   Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering [J].
Hao, Xiaolin ;
Jiang, Rui ;
Chen, Ting .
BIOINFORMATICS, 2011, 27 (05) :611-618
[7]   Accuracy and quality of massively parallel DNA pyrosequencing [J].
Huse, Susan M. ;
Huber, Julie A. ;
Morrison, Hilary G. ;
Sogin, Mitchell L. ;
Mark Welch, David .
GENOME BIOLOGY, 2007, 8 (07)
[8]   Ironing out the wrinkles in the rare biosphere through improved OTU clustering [J].
Huse, Susan M. ;
Welch, David Mark ;
Morrison, Hilary G. ;
Sogin, Mitchell L. .
ENVIRONMENTAL MICROBIOLOGY, 2010, 12 (07) :1889-1898
[9]   SPICi: a fast clustering algorithm for large biological networks [J].
Jiang, Peng ;
Singh, Mona .
BIOINFORMATICS, 2010, 26 (08) :1105-1111
[10]   COMPLEXITY OF FINITE SEQUENCES [J].
LEMPEL, A ;
ZIV, J .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1976, 22 (01) :75-81