Transcriptome annotation in the cloud: complexity, best practices, and cost

被引:8
作者
Alvarez, Roberto Vera [1 ]
Marino-Ramirez, Leonardo [1 ,2 ]
Landsman, David [1 ]
机构
[1] Natl Ctr Biotechnol Informat, Computat Biol Branch, Natl Lib Med, NIH, 9000 Rockville Pike, Bethesda, MD 20890 USA
[2] Natl Inst Minor Hlth & Hlth Dispar, Div Intramural Res, NIH, 9000 Rockville Pike, Bethesda, MD 20890 USA
来源
GIGASCIENCE | 2021年 / 10卷 / 02期
基金
美国国家卫生研究院;
关键词
D O I
10.1093/gigascience/giaa163
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative provides NIH-funded researchers cost-effective access to commercial cloud providers, such as Amazon Web Services (AWS) and Google Cloud Platform (GCP). These cloud providers represent an alternative for the execution of large computational biology experiments like transcriptome annotation, which is a complex analytical process that requires the interrogation of multiple biological databases with several advanced computational tools. The core components of annotation pipelines published since 2012 are BLAST sequence alignments using annotated databases of both nucleotide or protein sequences almost exclusively with networked on-premises compute systems. Findings: We compare multiple BLAST sequence alignments using AWS and GCP. We prepared several Jupyter Notebooks with all the code required to submit computing jobs to the batch system on each cloud provider. We consider the consequence of the number of query transcripts in input files and the effect on cost and processing time. We tested compute instances with 16, 32, and 64 vCPUs on each cloud provider. Four classes of timing results were collected: the total run time, the time for transferring the BLAST databases to the instance local solid-state disk drive, the time to execute the CWL script, and the time for the creation, set-up, and release of an instance. This study aims to establish an estimate of the cost and compute time needed for the execution of multiple BLAST runs in a cloud environment. Conclusions: We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with similar to 500,000 transcripts can be processed in <2 hours with a compute cost of similar to$200-$250. In our opinion, for BLAST-based workflows, the choice of cloud platform is not dependent on the workflow but, rather, on the specific details and requirements of the cloud provider. These choices include the accessibility for institutional use, the technical knowledge required for effective use of the platform services, and the availability of open source frameworks such as APIs to deploy the workflow.
引用
收藏
页数:11
相关论文
共 39 条
  • [1] Comprehensive Stress-Based De Novo Transcriptome Assembly and Annotation of Guar (Cyamopsis tetragonoloba (L.) Taub.): An Important Industrial and Forage Crop
    Al-Qurainy, Fahad
    Alshameri, Aref
    Gaafar, Abdel-Rhman
    Khan, Salim
    Nadeem, Mohammad
    Alameri, Abdulhafed Abdullah
    Tarroum, Mohamed
    Ashraf, Muhammad
    [J]. INTERNATIONAL JOURNAL OF GENOMICS, 2019, 2019
  • [2] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [3] Workflow and web application for annotating NCBI BioProject transcriptome data
    Alvarez, Roberto Vera
    Vidal, Newton Medeiros
    Garzon-Martinez, Gina A.
    Barrero, Luz S.
    Landsman, David
    Marino-Ramirez, Leonardo
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2017,
  • [4] [Anonymous], **DROPPED REF**
  • [5] [Anonymous], Common Workflow Language User Guide - Common Workflow Language User Guide 0.1 documentation, DOI DOI 10.1109/MCSE.2019.2906593
  • [6] [Anonymous], CLOUD LIFE SCI
  • [7] [Anonymous], CONTAINER REGISTRY
  • [8] [Anonymous], SRA in the cloud
  • [9] [Anonymous], Genome information by organism.
  • [10] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29