The application of Hadoop in structural bioinformatics

被引:8
作者
Alnasir, Jamie J. [1 ]
Shanahan, Hugh P. [2 ]
机构
[1] Inst Canc Res, Sci Comp, London, England
[2] Royal Holloway Univ London, Dept Comp Sci, Egham, Surrey, England
关键词
structural bioinformatics; Hadoop; spark; cloud computing; PROTEIN-PROTEIN INTERACTIONS; STRUCTURE ALIGNMENT; BINDING-SITES; MOLECULAR DOCKING; MAPREDUCE; PROGRAM; PREDICTION; FRAMEWORK; DATABASE; SERVER;
D O I
10.1093/bib/bby106
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the Protein Data Bank that is key for high-throughput studies of, for example, protein-ligand docking, clustering of protein-ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up.
引用
收藏
页码:96 / 105
页数:10
相关论文
共 91 条
[1]   Protein data bank archives of three-dimensional macromolecular structures [J].
Abola, EE ;
Sussman, JL ;
Prilusky, J ;
Manning, NO .
MACROMOLECULAR CRYSTALLOGRAPHY, PT B, 1997, 277 :556-571
[2]   The Cambridge Structural Database: a quarter of a million crystal structures and rising [J].
Allen, FH .
ACTA CRYSTALLOGRAPHICA SECTION B-STRUCTURAL SCIENCE, 2002, 58 (3 PART 1) :380-388
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]  
Amstutz P, 2016, COMMON WORKFLOW LANG, V3
[5]  
[Anonymous], 2011, HBase-The Definitive Guide: Random Access to Your Planet-Size Data
[6]  
[Anonymous], 1999, USING MPI PORTABLE P
[7]  
[Anonymous], P 2 INT WORKSH EM CO, DOI DOI 10.1145/1996023.1996028
[8]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[9]   MMTF-An efficient file format for the transmission, visualization, and analysis of macromolecular structures [J].
Bradley, Anthony R. ;
Rose, Alexander S. ;
Pavelka, Antonin ;
Valasatava, Yana ;
Duarte, Jose M. ;
Prlic, Andreas ;
Rose, Peter W. .
PLOS COMPUTATIONAL BIOLOGY, 2017, 13 (06)
[10]   CHARMM - A PROGRAM FOR MACROMOLECULAR ENERGY, MINIMIZATION, AND DYNAMICS CALCULATIONS [J].
BROOKS, BR ;
BRUCCOLERI, RE ;
OLAFSON, BD ;
STATES, DJ ;
SWAMINATHAN, S ;
KARPLUS, M .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 1983, 4 (02) :187-217