ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

被引:17
作者
Cai, Yunpeng [1 ]
Zheng, Wei [2 ]
Yao, Jin [3 ]
Yang, Yujie [1 ]
Mai, Volker [4 ]
Mao, Qi [3 ]
Sun, Yijun [2 ,3 ,5 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
[2] SUNY Buffalo, Dept Comp Sci & Engn, Buffalo, NY USA
[3] SUNY Buffalo, Dept Microbiol & Immunol, Buffalo, NY USA
[4] Univ Florida, Dept Epidemiol, Gainesville, FL USA
[5] SUNY Buffalo, Dept Biostat, Buffalo, NY USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
LARGE SETS; 16S RDNA; DEEP-SEA; ALIGNMENT; DIVERSITY; CLASSIFICATION; ALGORITHM; PROTEIN; TREES;
D O I
10.1371/journal.pcbi.1005518
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu-tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/similar to yijunsun/lab/ESPRIT-Forest.html.
引用
收藏
页数:16
相关论文
共 51 条
[1]  
[Anonymous], 2008, Algorithm Design Manual
[2]  
[Anonymous], SCIENCE
[3]   Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows [J].
Barriuso, Jorge ;
Valverde, Jose R. ;
Mellado, Rafael P. .
BMC BIOINFORMATICS, 2011, 12
[4]   Ultra-deep sequencing for the analysis of viral populations [J].
Beerenwinkel, Niko ;
Zagordi, Osvaldo .
CURRENT OPINION IN VIROLOGY, 2011, 1 (05) :413-418
[5]   Comparing clustering and pre-processing in taxonomy analysis [J].
Bonder, Marc J. ;
Abeln, Sanne ;
Zaura, Egija ;
Brandt, Bernd W. .
BIOINFORMATICS, 2012, 28 (22) :2891-2897
[6]   Measurement and Clinical Monitoring of Human Lymphocyte Clonality by Massively Parallel V-D-J Pyrosequencing [J].
Boyd, Scott D. ;
Marshall, Eleanor L. ;
Merker, Jason D. ;
Maniar, Jay M. ;
Zhang, Lyndon N. ;
Sahaf, Bita ;
Jones, Carol D. ;
Simen, Birgitte B. ;
Hanczaruk, Bozena ;
Nguyen, Khoa D. ;
Nadeau, Kari C. ;
Egholm, Michael ;
Miklos, David B. ;
Zehnder, James L. ;
Fire, Andrew Z. .
SCIENCE TRANSLATIONAL MEDICINE, 2009, 1 (12)
[7]   ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time [J].
Cai, Yunpeng ;
Sun, Yijun .
NUCLEIC ACIDS RESEARCH, 2011, 39 (14) :e95
[8]   MSClust: A Multi-Seeds based Clustering algorithm for microbiome profiling using 16S rRNA sequence [J].
Chen, Wei ;
Cheng, Yongmei ;
Zhang, Clarence ;
Zhang, Shaowu ;
Zhao, Hongyu .
JOURNAL OF MICROBIOLOGICAL METHODS, 2013, 94 (03) :347-355
[9]   Composition, variability, and temporal stability of the intestinal microbiota of the elderly [J].
Claesson, Marcus J. ;
Cusack, Siobhan ;
O'Sullivan, Orla ;
Greene-Diniz, Rachel ;
de Weerd, Heleen ;
Flannery, Edel ;
Marchesi, Julian R. ;
Falush, Daniel ;
Dinan, Timothy ;
Fitzgerald, Gerald ;
Stanton, Catherine ;
van Sinderen, Douwe ;
O'Connor, Michael ;
Harnedy, Norma ;
O'Connor, Kieran ;
Henry, Colm ;
O'Mahony, Denis ;
Fitzgerald, Anthony P. ;
Shanahan, Fergus ;
Twomey, Cillian ;
Hill, Colin ;
Ross, R. Paul ;
O'Toole, Paul W. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2011, 108 :4586-4591
[10]   The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis [J].
Cole, JR ;
Chai, B ;
Farris, RJ ;
Wang, Q ;
Kulam, SA ;
McGarrell, DM ;
Garrity, GM ;
Tiedje, JM .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D294-D296