Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

被引：0

作者：

Bongo, Lars Ailo ^{[1
,2
]}

Pedersen, Edvard ^{[1
,2
]}

Ernstsen, Martin

机构：

[1] Univ Tromso, Dept Comp Sci, Tromso, Norway

[2] Univ Tromso, Ctr Bioinformat, Tromso, Norway

来源：

COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS, CIBB 2014 | 2015年 / 8623卷

关键词：

data-intensive computing; biological data analysis; flexible pipelines; infrastructure systems; GALAXY;

D O I：

10.1007/978-3-319-24462-4_22

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems. We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.

引用

页码：259 / 272

页数：14

共 33 条

[1] BASIC LOCAL ALIGNMENT SEARCH TOOL [J].

ALTSCHUL, SF ;

GISH, W ;

MILLER, W ;

MYERS, EW ;

LIPMAN, DJ .

JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410

[2]

[Anonymous], UNIPROT KNOWLEDGEBAS

[3] NCBI GEO: archive for functional genomics data sets-10 years on [J].

Barrett, Tanya ;

Troup, Dennis B. ;

Wilhite, Stephen E. ;

Ledoux, Pierre ;

Evangelista, Carlos ;

Kim, Irene F. ;

Tomashevsky, Maxim ;

Marshall, Kimberly A. ;

Phillippy, Katherine H. ;

Sherman, Patti M. ;

Muertter, Rolf N. ;

Holko, Michelle ;

Ayanbule, Oluwabukunmi ;

Yefanov, Andrey ;

Soboleva, Alexandra .

NUCLEIC ACIDS RESEARCH, 2011, 39 :D1005-D1010

[4] Editorial: Nucleic Acids Research annual Web Server Issue in 2014 [J].

Benson, Gary .

NUCLEIC ACIDS RESEARCH, 2014, 42 (W1) :W1-W2

[5] Dissemination of scientific software with Galaxy ToolShed [J].

Blankenberg, Daniel ;

Von Kuster, Gregory ;

Bouvier, Emil ;

Baker, Dannon ;

Afgan, Enis ;

Stoler, Nicholas ;

Team, Galaxy ;

Taylor, James ;

Nekrutenko, Anton .

GENOME BIOLOGY, 2014, 15 (02)

[6] Bigtable: A distributed storage system for structured data [J].

Chang, Fay ;

Dean, Jeffrey ;

Ghemawat, Sanjay ;

Hsieh, Wilson C. ;

Wallach, Deborah A. ;

Burrows, Mike ;

Chandra, Tushar ;

Fikes, Andrew ;

Gruber, Robert E. .

ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2008, 26 (02)

[7]

Clarke L, 2012, NAT METHODS, V9, P1, DOI [10.1038/NMETH.1974, 10.1038/nmeth.1974]

[8]

Dean J., 2004, P OP SYST DES IMPL U

[9] MapReduce: A Flexible Data Processing Tool [J].

Dean, Jeffrey ;

Ghemawat, Sanjay .

COMMUNICATIONS OF THE ACM, 2010, 53 (01) :72-77

[10]

Diao Y., 2015, 7 BIENN C INN DAT SY

← 1 2 3 4 →