Container-based bioinformatics with Pachyderm

被引:26
作者
Novella, Jon Ander [1 ,2 ]
Emami Khoonsari, Payam [3 ]
Herman, Stephanie [1 ,2 ,3 ]
Whitenack, Daniel [4 ]
Capuccini, Marco [1 ,2 ,5 ]
Burman, Joachim [6 ]
Kultima, Kim [3 ]
Spjuth, Ola [1 ,2 ]
机构
[1] Uppsala Univ, Dept Pharmaceut Biosci, S-75214 Uppsala, Sweden
[2] Uppsala Univ, Sci Life Lab, S-75214 Uppsala, Sweden
[3] Uppsala Univ, Dept Med Sci, Clin Chem, S-75185 Uppsala, Sweden
[4] Pachyderm Inc, San Francisco, CA 94107 USA
[5] Uppsala Univ, Dept Informat Technol, S-75105 Uppsala, Sweden
[6] Uppsala Univ, Dept Neurosci, S-75185 Uppsala, Sweden
基金
瑞典研究理事会; 欧盟地平线“2020”;
关键词
MASS-SPECTROMETRY;
D O I
10.1093/bioinformatics/bty699
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation Computational biologists face many challenges related to data size, and they need to manage complicated analyses often including multiple stages and multiple tools, all of which must be deployed to modern infrastructures. To address these challenges and maintain reproducibility of results, researchers need (i) a reliable way to run processing stages in any computational environment, (ii) a well-defined way to orchestrate those processing stages and (iii) a data management layer that tracks data as it moves through the processing pipeline. Results Pachyderm is an open-source workflow system and data management framework that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem, having Kubernetes as the backbone for container orchestration. We adapted Pachyderm and demonstrated its attractive properties in bioinformatics. A Helm Chart was created so that researchers can use Pachyderm in multiple scenarios. The Pachyderm File System was extended to support block storage. A wrapper for initiating Pachyderm on cloud-agnostic virtual infrastructures was created. The benefits of Pachyderm are illustrated via a large metabolomics workflow, demonstrating that Pachyderm enables efficient and sustainable data science workflows while maintaining reproducibility and scalability.
引用
收藏
页码:839 / 846
页数:8
相关论文
共 39 条
  • [1] The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update
    Afgan, Enis
    Baker, Dannon
    van den Beek, Marius
    Blankenberg, Daniel
    Bouvier, Dave
    Cech, Martin
    Chilton, John
    Clements, Dave
    Coraor, Nate
    Eberhard, Carl
    Gruening, Bjoern
    Guerler, Aysam
    Hillman-Jackson, Jennifer
    Von Kuster, Greg
    Rasche, Eric
    Soranzo, Nicola
    Turaga, Nitesh
    Taylor, James
    Nekrutenko, Anton
    Goecks, Jeremy
    [J]. NUCLEIC ACIDS RESEARCH, 2016, 44 (W1) : W3 - W10
  • [2] [Anonymous], 2015, Kubernetes - Scheduling the Future at Cloud Scale
  • [3] [Anonymous], 2017, PEERJ PREPR
  • [4] The hard road to reproducibility
    Barba, Lorena A.
    [J]. SCIENCE, 2016, 354 (6308) : 142 - 142
  • [5] Reproducibility in Science Improving the Standard for Basic and Preclinical Research
    Begley, C. Glenn
    Ioannidis, John P. A.
    [J]. CIRCULATION RESEARCH, 2015, 116 (01) : 116 - 126
  • [6] Burns B., 2016, P 8 USENIX C HOT TOP
  • [7] Capuccini M., 2018, OPENACCESS SERIES IN
  • [8] Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
  • [9] Mass spectrometry-based metabolomics
    Dettmer, Katja
    Aronov, Pavel A.
    Hammock, Bruce D.
    [J]. MASS SPECTROMETRY REVIEWS, 2007, 26 (01) : 51 - 78
  • [10] Nextflow enables reproducible computational workflows
    Di Tommaso, Paolo
    Chatzou, Maria
    Floden, Evan W.
    Prieto Barja, Pablo
    Palumbo, Emilio
    Notredame, Cedric
    [J]. NATURE BIOTECHNOLOGY, 2017, 35 (04) : 316 - 319