Parallelizing XML data-streaming workflows via MapReduce

被引:15
作者
Zinn, Daniel [1 ]
Bowers, Shawn [2 ,3 ]
Koehler, Sven [2 ]
Ludaescher, Bertram [1 ,2 ]
机构
[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
[2] Univ Calif Davis, UC Davis Genome Ctr, Davis, CA 95616 USA
[3] Gonzaga Univ, Dept Comp Sci, Spokane, WA 99258 USA
关键词
MapReduce; XML processing pipelines; Collection-Oriented Modeling and Design (COMAD); Virtual Data Assembly Line (VDAL); Parallelization; Static analysis; Grouping; Data stream processing; SYSTEM;
D O I
10.1016/j.jcss.2009.11.006
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to "black-box" (scientific) functions, modified (i.e.. updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based Scientific workflows (C) 2009 Elsevier Inc. All rights reserved
引用
收藏
页码:447 / 463
页数:17
相关论文
共 47 条
  • [31] McPhillips T, 2006, LECT NOTES COMPUT SC, V4075, P248
  • [32] Scientific workflow design for mere mortals
    McPhillips, Timothy
    Bowers, Shawn
    Zinn, Daniel
    Ludaescher, Bertram
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2009, 25 (05): : 541 - 551
  • [34] O'Neil P., 2004, SIGMOD, P903
  • [35] Taverna: lessons in creating a workflow environment for the life sciences
    Oinn, Tom
    Greenwood, Mark
    Addis, Matthew
    Alpdemir, M. Nedim
    Ferris, Justin
    Glover, Kevin
    Goble, Carole
    Goderis, Antoon
    Hull, Duncan
    Marvin, Darren
    Li, Peter
    Lord, Phillip
    Pocock, Matthew R.
    Senger, Martin
    Stevens, Robert
    Wipat, Anil
    Wroe, Chris
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2006, 18 (10) : 1067 - 1100
  • [36] PODHORSZKI N, 2007, WORKS 07, P35
  • [37] Qin J., 2007, SC 07, P1
  • [38] Re C., 2004, PROC WORKSHOP INFORM, P116
  • [39] Taylor I., 2003, J GRID COMPUT, V1, P199, DOI DOI 10.1023/B:GRID.0000024074.63139.CE
  • [40] Taylor I.J., 2007, WORKFLOWS E SCI