Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis

被引:15
作者
Möller S. [1 ]
Prescott S.W. [2 ]
Wirzenius L. [3 ]
Reinholdtsen P. [4 ]
Chapman B. [5 ]
Prins P. [6 ]
Soiland-Reyes S. [7 ,8 ]
Klötzl F. [9 ]
Bagnacani A. [10 ]
Kalaš M. [11 ]
Tille A.
Crusoe M.R.
机构
[1] Rostock University Medical Center, Institute for Biostatistics and Informatics in Medicine and Ageing Research, Rostock
[2] School of Chemical Engineering, UNSW, Sydney, 2052, NSW
[3] QvarnLabs, Helsinki
[4] University Center for Information Technology, University of Oslo, Oslo
[5] Harvard School of Public Health, Boston, MA
[6] University Medical Center Utrecht, Utrecht
[7] eScience Lab, School of Computer Science, The University of Manchester, Manchester
[8] Apache Software Foundation, Forest Hill, MD
[9] Max-Planck-Institute for Evolutionary Biology, Plön
[10] Department of Systems Biology and Bioinformatics, University of Rostock, Rostock
[11] Computational Biology Unit, Department of Informatics, University of Bergen, Bergen
基金
欧盟地平线“2020”;
关键词
Automated installation; Common workflow language; Container; Continuous integration testing; Software distribution;
D O I
10.1007/s41019-017-0050-4
中图分类号
学科分类号
摘要
Information integration and workflow technologies for data analysis have always been major fields of investigation in bioinformatics. A range of popular workflow suites are available to support analyses in computational biology. Commercial providers tend to offer prepared applications remote to their clients. However, for most academic environments with local expertise, novel data collection techniques or novel data analysis, it is essential to have all the flexibility of open-source tools and open-source workflow descriptions. Workflows in data-driven science such as computational biology have considerably gained in complexity. New tools or new releases with additional features arrive at an enormous pace, and new reference data or concepts for quality control are emerging. A well-abstracted workflow and the exchange of the same across work groups have an enormous impact on the efficiency of research and the further development of the field. High-throughput sequencing adds to the avalanche of data available in the field; efficient computation and, in particular, parallel execution motivate the transition from traditional scripts and Makefiles to workflows. We here review the extant software development and distribution model with a focus on the role of integration testing and discuss the effect of common workflow language on distributions of open-source scientific software to swiftly and reliably provide the tools demanded for the execution of such formally described workflows. It is contended that, alleviated from technical differences for the execution on local machines, clusters or the cloud, communities also gain the technical means to test workflow-driven interaction across several software packages. © 2017, The Author(s).
引用
收藏
页码:232 / 244
页数:12
相关论文
共 43 条
[1]  
Afgan E., Baker D., van den Beek M., Blankenberg D., Bouvier D., Cech M., Chilton J., Clements D., Coraor N., Eberhard C., Gruning B., Guerler A., Hillman-Jackson J., Von Kuster G., Rasche E., Soranzo N., Turaga N., Taylor J., Nekrutenko A., Goecks J., The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update, Nucleic Acids Res, 44, W1, (2016)
[2]  
Amadio G., Xu B., Portage: Bringing hackers’ wisdom to science, Corr, (2016)
[3]  
Amstutz P., Crusoe M.R., Tijanic N., Chapman B., Chilton J., Heuer M., Kartashov A., Leehr D., Menager H., Nedeljkovich M., Scales M., Soiland-Reyes S., Stojanovic L., (2016)
[4]  
Bandrowski A., Brush M., Grethe J.S., Haendel M.A., Kennedy D.N., Hill S., Hof P.R., Martone M.E., Pols M., Tan S., Washington N., Zudilova-Seinstra E., Vasilevsky N., The resource identification initiative: a cultural shift in publishing [version 2
[5]  
referees: 2 approved], F1000Research, 6, ISCB Comm J, (2015)
[6]  
Berthold M.R., Cebron N., Dill F., Gabriel T.R., Kotter T., Meinl T., Ohl P., Sieb C., Thiel K., Wiswedel B., KNIME: the Konstanz information miner, pp. 319-326, (2008)
[7]  
Christensen A., Egge T., Store—a system for handling third-party applications in a heterogeneous computer environment, Software configuration management: ICSE SCM-4 and SCM-5 workshops selected papers, pp. 263-276, (1995)
[8]  
Gruning B., Dale R., Sjodin A., Rowe J., Chapman B.A., Tomkins-Tinch C.H., Valieris R., Koster J., Bioconda: A Sustainable and Comprehensive Software Distribution for the Life Sciences, (2017)
[9]  
Di Tommaso P., Chatzou M., Floden E.W., Barja P.P., Palumbo E., Notredame C., Nextflow enables reproducible computational workflows, Nat Biotechnol, 35, 4, pp. 316-319, (2017)
[10]  
Gentleman R., Carey V., Bates D., Bolstad B., Dettling M., Dudoit S., Ellis B., Gautier L., Ge Y., Gentry J., Hornik K., Hothorn T., Huber W., Iacus S., Irizarry R., Leisch F., Li C., Maechler M., Rossini A., Sawitzki G., Smith C., Smyth G., Tierney L., Yang J., Zhang J., Bioconductor: open software development for computational biology and bioinformatics, Genome Biol, 5, 10, (2004)