Integrated Systems for NGS Data Management and Analysis: Open Issues and Available Solutions

被引:33
作者
Bianchi, Valerio [1 ,3 ,4 ]
Ceol, Arnaud [1 ]
Ogier, Alessandro G. E. [2 ]
de Pretis, Stefano [1 ]
Galeota, Eugenia [1 ]
Kishore, Kamal [1 ]
Bora, Pranami [1 ]
Croci, Ottavio [1 ]
Campaner, Stefano [1 ]
Amati, Bruno [1 ,2 ]
Morelli, Marco J. [1 ]
Pelizzola, Mattia [1 ]
机构
[1] Fdn Ist Italiano Tecnol, Ctr Genom Sci, IIT SEMM, Milan, Italy
[2] European Inst Oncol, Dept Expt Oncol, Milan, Italy
[3] Hubrecht Inst KNAW, Uppsalalaan 8, NL-3584 CT Utrecht, Netherlands
[4] Univ Med Ctr, Uppsalalaan 8, NL-3584 CT Utrecht, Netherlands
关键词
MICROARRAY; SOFTWARE; BIOINFORMATICS; FRAMEWORK; TAVERNA; BIOLOGY; SUITE; TOOL; RNA;
D O I
10.3389/fgene.2016.00075
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Next-generation sequencing (NGS) technologies have deeply changed our understanding of cellular processes by delivering an astonishing amount of data at affordable prices; nowadays, many biology laboratories have already accumulated a large number of sequenced samples. However, managing and analyzing these data poses new challenges, which may easily be underestimated by research groups devoid of IT and quantitative skills. In this perspective, we identify five issues that should be carefully addressed by research groups approaching NGS technologies. In particular, the five key issues to be considered concern: (1) adopting a laboratory management system (LIMS) and safeguard the resulting raw data structure in downstream analyses; (2) monitoring the flow of the data and standardizing input and output directories and file names, even when multiple analysis protocols are used on the same data; (3) ensuring complete traceability of the analysis performed; (4) enabling non experienced users to run analyses through a graphical user interface (GUI) acting as a front-end for the pipelines; (5) relying on standard metadata to annotate the datasets, and when possible using controlled vocabularies, ideally derived from biomedical ontologies. Finally, we discuss the currently available tools in the light of these issues, and we introduce HIS-flow, a new workflow management system conceived to address the concerns we raised. HIS-flow is able to retrieve information from a LIMS database, manages data analyses through a simple GUI, outputs data in standard locations and allows the complete traceability of datasets, accompanying metadata and analysis scripts.
引用
收藏
页数:8
相关论文
共 28 条
[1]   Genomics Virtual Laboratory: A Practical Bioinformatics Workbench for the Cloud [J].
Afgan, Enis ;
Sloggett, Clare ;
Goonasekera, Nuwan ;
Makunin, Igor ;
Benson, Derek ;
Crowe, Mark ;
Gladman, Simon ;
Kowsar, Yousef ;
Pheasant, Michael ;
Horst, Ron ;
Lonie, Andrew .
PLOS ONE, 2015, 10 (10)
[2]   Galaxy CloudMan: delivering cloud compute clusters [J].
Afgan, Enis ;
Baker, Dannon ;
Coraor, Nate ;
Chapman, Brad ;
Nekrutenko, Anton ;
Taylor, James .
BMC BIOINFORMATICS, 2010, 11
[3]   A framework for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly [J].
Blankenberg, Daniel ;
Taylor, James ;
Schenck, Ian ;
He, Jianbin ;
Zhang, Yi ;
Ghent, Matthew ;
Veeraraghavan, Narayanan ;
Albert, Istvan ;
Miller, Webb ;
Makova, Kateryna D. ;
Hardison, Ross C. ;
Nekrutenko, Anton .
GENOME RESEARCH, 2007, 17 (06) :960-964
[4]   Multi-omic data analysis using Galaxy [J].
Boekel, Jorrit ;
Chilton, John M. ;
Cooke, Ira R. ;
Horvatovich, Peter L. ;
Jagtap, Pratik D. ;
Kall, Lukas ;
Lehtio, Janne ;
Lukasse, Pieter ;
Moerland, Perry D. ;
Griffin, Timothy J. .
NATURE BIOTECHNOLOGY, 2015, 33 (02) :137-139
[5]   Minimum information about a microarray experiment (MIAME) - toward standards for microarray data [J].
Brazma, A ;
Hingamp, P ;
Quackenbush, J ;
Sherlock, G ;
Spellman, P ;
Stoeckert, C ;
Aach, J ;
Ansorge, W ;
Ball, CA ;
Causton, HC ;
Gaasterland, T ;
Glenisson, P ;
Holstege, FCP ;
Kim, IF ;
Markowitz, V ;
Matese, JC ;
Parkinson, H ;
Robinson, A ;
Sarkans, U ;
Schulze-Kremer, S ;
Stewart, J ;
Taylor, R ;
Vilo, J ;
Vingron, M .
NATURE GENETICS, 2001, 29 (04) :365-371
[6]   The MI bundle: enabling network and structural biology in genome visualization tools [J].
Ceol, Arnaud ;
Muller, Heiko .
BIOINFORMATICS, 2015, 31 (22) :3679-3681
[7]   INSPEcT: a computational tool to infer mRNA synthesis, processing and degradation dynamics from RNA- and 4sU-seq time course experiments [J].
de Pretis, Stefano ;
Kress, Theresia ;
Morelli, Marco J. ;
Melloni, Giorgio E. M. ;
Riva, Laura ;
Amati, Bruno ;
Pelizzola, Mattia .
BIOINFORMATICS, 2015, 31 (17) :2829-2835
[8]   The Distributed Annotation System [J].
Dowell, Robin D. ;
Jokerst, Rodney M. ;
Day, Allen ;
Eddy, Sean R. ;
Stein, Lincoln .
BMC BIOINFORMATICS, 2001, 2 (1)
[9]   Omics Pipe: a community-based framework for reproducible multi-omics data analysis [J].
Fisch, Kathleen M. ;
Meissner, Tobias ;
Gioia, Louis ;
Ducom, Jean-Christophe ;
Carland, Tristan M. ;
Loguercio, Salvatore ;
Su, Andrew I. .
BIOINFORMATICS, 2015, 31 (11) :1724-1728
[10]   Bioconductor: open software development for computational biology and bioinformatics [J].
Gentleman, RC ;
Carey, VJ ;
Bates, DM ;
Bolstad, B ;
Dettling, M ;
Dudoit, S ;
Ellis, B ;
Gautier, L ;
Ge, YC ;
Gentry, J ;
Hornik, K ;
Hothorn, T ;
Huber, W ;
Iacus, S ;
Irizarry, R ;
Leisch, F ;
Li, C ;
Maechler, M ;
Rossini, AJ ;
Sawitzki, G ;
Smith, C ;
Smyth, G ;
Tierney, L ;
Yang, JYH ;
Zhang, JH .
GENOME BIOLOGY, 2004, 5 (10)