Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

被引:0
作者
Bongo, Lars Ailo [1 ,2 ]
Pedersen, Edvard [1 ,2 ]
Ernstsen, Martin
机构
[1] Univ Tromso, Dept Comp Sci, Tromso, Norway
[2] Univ Tromso, Ctr Bioinformat, Tromso, Norway
来源
COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS, CIBB 2014 | 2015年 / 8623卷
关键词
data-intensive computing; biological data analysis; flexible pipelines; infrastructure systems; GALAXY;
D O I
10.1007/978-3-319-24462-4_22
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems. We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.
引用
收藏
页码:259 / 272
页数:14
相关论文
共 33 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]  
[Anonymous], UNIPROT KNOWLEDGEBAS
[3]   NCBI GEO: archive for functional genomics data sets-10 years on [J].
Barrett, Tanya ;
Troup, Dennis B. ;
Wilhite, Stephen E. ;
Ledoux, Pierre ;
Evangelista, Carlos ;
Kim, Irene F. ;
Tomashevsky, Maxim ;
Marshall, Kimberly A. ;
Phillippy, Katherine H. ;
Sherman, Patti M. ;
Muertter, Rolf N. ;
Holko, Michelle ;
Ayanbule, Oluwabukunmi ;
Yefanov, Andrey ;
Soboleva, Alexandra .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D1005-D1010
[4]   Editorial: Nucleic Acids Research annual Web Server Issue in 2014 [J].
Benson, Gary .
NUCLEIC ACIDS RESEARCH, 2014, 42 (W1) :W1-W2
[5]   Dissemination of scientific software with Galaxy ToolShed [J].
Blankenberg, Daniel ;
Von Kuster, Gregory ;
Bouvier, Emil ;
Baker, Dannon ;
Afgan, Enis ;
Stoler, Nicholas ;
Team, Galaxy ;
Taylor, James ;
Nekrutenko, Anton .
GENOME BIOLOGY, 2014, 15 (02)
[6]   Bigtable: A distributed storage system for structured data [J].
Chang, Fay ;
Dean, Jeffrey ;
Ghemawat, Sanjay ;
Hsieh, Wilson C. ;
Wallach, Deborah A. ;
Burrows, Mike ;
Chandra, Tushar ;
Fikes, Andrew ;
Gruber, Robert E. .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2008, 26 (02)
[7]  
Clarke L, 2012, NAT METHODS, V9, P1, DOI [10.1038/NMETH.1974, 10.1038/nmeth.1974]
[8]  
Dean J., 2004, P OP SYST DES IMPL U
[9]   MapReduce: A Flexible Data Processing Tool [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2010, 53 (01) :72-77
[10]  
Diao Y., 2015, 7 BIENN C INN DAT SY