DDBJ Read Annotation Pipeline: A Cloud Computing-Based Pipeline for High-Throughput Analysis of Next-Generation Sequencing Data

被引:47
作者
Nagasaki, Hideki [1 ,2 ]
Mochizuki, Takako [1 ,2 ]
Kodama, Yuichi [1 ,2 ]
Saruhashi, Satoshi [1 ,2 ]
Morizaki, Shota [3 ]
Sugawara, Hideaki [1 ,2 ]
Ohyanagi, Hajime [4 ]
Kurata, Nori [4 ]
Okubo, Kousaku [1 ,2 ,5 ]
Takagi, Toshihisa [1 ,2 ,5 ]
Kaminuma, Eli [1 ,2 ]
Nakamura, Yasukazu [1 ,2 ]
机构
[1] Natl Inst Genet, Ctr Informat Biol, Mishima, Shizuoka 4118510, Japan
[2] Natl Inst Genet, DNA Data Bank Japan, Mishima, Shizuoka 4118510, Japan
[3] Fujisoft Inc, Chiyoda Ku, Tokyo 1010022, Japan
[4] Natl Inst Genet, Plant Genet Lab, Mishima, Shizuoka 4118510, Japan
[5] Database Ctr Life Sci, Bunkyo Ku, Tokyo 1130032, Japan
关键词
next-generation sequencing; sequence read archive; cloud computing; analytical pipeline; genome analysis; BURROWS-WHEELER TRANSFORM; RNA-SEQ DATA; GENOME SEQUENCE; ALIGNMENT; ULTRAFAST; ASSEMBLER; VARIANTS; ARCHIVE; BIOLOGY; FORMAT;
D O I
10.1093/dnares/dst017
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecular biological research. However, the immense amount of sequence data requires computational skills and suitable hardware resources that are a challenge to molecular biologists. The DNA Data Bank of Japan (DDBJ) of the National Institute of Genetics (NIG) has initiated a cloud computing-based analytical pipeline, the DDBJ Read Annotation Pipeline (DDBJ Pipeline), for a high-throughput annotation of NGS reads. The DDBJ Pipeline offers a user-friendly graphical web interface and processes massive NGS datasets using decentralized processing by NIG supercomputers currently free of charge. The proposed pipeline consists of two analysis components: basic analysis for reference genome mapping and de novo assembly and subsequent high-level analysis of structural and functional annotations. Users may smoothly switch between the two components in the pipeline, facilitating web-based operations on a supercomputer for high-throughput data analysis. Moreover, public NGS reads of the DDBJ Sequence Read Archive located on the same supercomputer can be imported into the pipeline through the input of only an accession number. This proposed pipeline will facilitate research by utilizing unified analytical workflows applied to the NGS data. The DDBJ Pipeline is accessible at http://p.ddbj.nig.ac.jp/.
引用
收藏
页码:383 / 390
页数:8
相关论文
共 43 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing [J].
Angiuoli, Samuel V. ;
Matalka, Malcolm ;
Gussman, Aaron ;
Galens, Kevin ;
Vangala, Mahesh ;
Riley, David R. ;
Arze, Cesar ;
White, James R. ;
White, Owen ;
Fricke, W. Florian .
BMC BIOINFORMATICS, 2011, 12
[3]   Prediction of complete gene structures in human genomic DNA [J].
Burge, C ;
Karlin, S .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) :78-94
[4]   The year of sequencing [J].
Chi, Kelly Rae .
NATURE METHODS, 2008, 5 (01) :11-14
[5]   The International Nucleotide Sequence Database Collaboration [J].
Cochrane, Guy ;
Karsch-Mizrachi, Ilene ;
Nakamura, Yasukazu .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D15-D18
[6]   The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].
Cock, Peter J. A. ;
Fields, Christopher J. ;
Goto, Naohisa ;
Heuer, Michael L. ;
Rice, Peter M. .
NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771
[7]   De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data [J].
DiGuistini, Scott ;
Liao, Nancy Y. ;
Platt, Darren ;
Robertson, Gordon ;
Seidel, Michael ;
Chan, Simon K. ;
Docking, T. Roderick ;
Birol, Inanc ;
Holt, Robert A. ;
Hirst, Martin ;
Mardis, Elaine ;
Marra, Marco A. ;
Hamelin, Richard C. ;
Bohlmann, Joerg ;
Breuil, Colette ;
Jones, Steven J. M. .
GENOME BIOLOGY, 2009, 10 (09)
[8]   Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences [J].
Goecks, Jeremy ;
Nekrutenko, Anton ;
Taylor, James .
GENOME BIOLOGY, 2010, 11 (08)
[9]   Full-length transcriptome assembly from RNA-Seq data without a reference genome [J].
Grabherr, Manfred G. ;
Haas, Brian J. ;
Yassour, Moran ;
Levin, Joshua Z. ;
Thompson, Dawn A. ;
Amit, Ido ;
Adiconis, Xian ;
Fan, Lin ;
Raychowdhury, Raktima ;
Zeng, Qiandong ;
Chen, Zehua ;
Mauceli, Evan ;
Hacohen, Nir ;
Gnirke, Andreas ;
Rhind, Nicholas ;
di Palma, Federica ;
Birren, Bruce W. ;
Nusbaum, Chad ;
Lindblad-Toh, Kerstin ;
Friedman, Nir ;
Regev, Aviv .
NATURE BIOTECHNOLOGY, 2011, 29 (07) :644-U130
[10]   Whole-genome sequencing and variant discovery in C-elegans [J].
Hillier, LaDeana W. ;
Marth, Gabor T. ;
Quinlan, Aaron R. ;
Dooling, David ;
Fewell, Ginger ;
Barnett, Derek ;
Fox, Paul ;
Glasscock, Jarret I. ;
Hickenbotham, Matthew ;
Huang, Weichun ;
Magrini, Vincent J. ;
Richt, Ryan J. ;
Sander, Sacha N. ;
Stewart, Donald A. ;
Stromberg, Michael ;
Tsung, Eric F. ;
Wylie, Todd ;
Schedl, Tim ;
Wilson, Richard K. ;
Mardis, Elaine R. .
NATURE METHODS, 2008, 5 (02) :183-188