MPI-blastn and NCBI-TaxCollector: Improving metagenomic analysis with high performance classification and wide taxonomic attachment

被引:7
作者
Dias, R. [1 ]
Xavier, M. G. [2 ]
Rossi, F. D. [2 ]
Neves, M. V. [2 ]
Lange, T. A. P. [2 ]
Giongo, A. [3 ]
De Rose, C. A. F. [2 ]
Triplett, E. W. [1 ]
机构
[1] Univ Florida, Dept Microbiol & Cell Sci, Gainesville, FL 32611 USA
[2] Pontiph Catholic Univ Rio Grande Sul, Fac Informat, Porto Alegre, RS, Brazil
[3] Pontiph Catholic Univ Rio Grande Sul, Fac Biosci, Porto Alegre, RS, Brazil
关键词
BLAST; TaxCollector; NCBI; sequence alignment; metagenomics; taxonomy assignment; taxonomic attachment; PROJECT RDP-II; DATABASE;
D O I
10.1142/S0219720014500139
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Metagenomic sequencing technologies are advancing rapidly and the size of output data from high-throughput genetic sequencing has increased substantially over the years. This brings us to a scenario where advanced computational optimizations are requested to perform a metagenomic analysis. In this paper, we describe a new parallel implementation of nucleotide BLAST (MPI-blastn) and a new tool for taxonomic attachment of Basic Local Alignment Search Tool (BLAST) results that supports the NCBI taxonomy (NCBI-TaxCollector). MPI-blastn obtained a high performance when compared to the mpiBLAST and ScalaBLAST. In our best case, MPIblastn was able to run 408 times faster in 384 cores. Our evaluations demonstrated that NCBI-TaxCollector is able to perform taxonomic attachments 125 times faster and needs 120 times less RAM than the previous TaxCollector. Through our optimizations, a multiple sequence search that currently takes 37 hours can be performed in less than 6 min and a post processing with NCBI taxonomic data attachment, which takes 48 hours, now is able to run in 23 min.
引用
收藏
页数:17
相关论文
共 28 条
[1]  
[Anonymous], CURR PROTOC HUM GENE
[2]  
Benson D A., 2011, Nucl. Acids Res
[3]  
Black PaulE., 2019, Dictionary of Algorithms and Data Structures
[4]   BLAST plus : architecture and applications [J].
Camacho, Christiam ;
Coulouris, George ;
Avagyan, Vahram ;
Ma, Ning ;
Papadopoulos, Jason ;
Bealer, Kevin ;
Madden, Thomas L. .
BMC BIOINFORMATICS, 2009, 10
[5]   The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data [J].
Cole, J. R. ;
Chai, B. ;
Farris, R. J. ;
Wang, Q. ;
Kulam-Syed-Mohideen, A. S. ;
McGarrell, D. M. ;
Bandela, A. M. ;
Cardenas, E. ;
Garrity, G. M. ;
Tiedje, J. M. .
NUCLEIC ACIDS RESEARCH, 2007, 35 :D169-D172
[6]   The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis [J].
Cole, JR ;
Chai, B ;
Farris, RJ ;
Wang, Q ;
Kulam, SA ;
McGarrell, DM ;
Garrity, GM ;
Tiedje, JM .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D294-D296
[7]  
de Araujo Macedo E., 2011, 2011 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, P418, DOI 10.1109/IPDPS.2011.169
[8]   Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB [J].
DeSantis, T. Z. ;
Hugenholtz, P. ;
Larsen, N. ;
Rojas, M. ;
Brodie, E. L. ;
Keller, K. ;
Huber, T. ;
Dalevi, D. ;
Hu, P. ;
Andersen, G. L. .
APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2006, 72 (07) :5069-5072
[9]   The NCBI BioSystems database [J].
Geer, Lewis Y. ;
Marchler-Bauer, Aron ;
Geer, Renata C. ;
Han, Lianyi ;
He, Jane ;
He, Siqian ;
Liu, Chunlei ;
Shi, Wenyao ;
Bryant, Stephen H. .
NUCLEIC ACIDS RESEARCH, 2010, 38 :D492-D496
[10]  
Giongo A., 2010, DIVERSITY-BASEL, V2, P1015, DOI [DOI 10.3390/d2071015, 10.3390/d2071015]