Flu chi: a quality-driven dataflow model for data intensive computing

被引:4
作者
Esteves, sergio [1 ]
Silva, Joao Nuno [1 ]
Veiga, Luis [1 ]
机构
[1] Univ Tecn Lisboa, Inst Super Tecn, INESC ID Lisboa Distributed Syst Grp, Rua Alves Redol 9, P-1000029 Lisbon, Portugal
关键词
Dataflow; Workflow; Quality-of-Data; Data store; NoSQL;
D O I
10.1186/1869-0238-4-12
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Today, there is a growing need for organizations to continuously analyze and process large waves of incoming data from the Internet. Such data processing schemes are often governed by complex dataflow systems, which are deployed atop highly-scalable infrastructures that need to manage data efficiently in order to enhance performance and alleviate costs. Current workflow management systems enforce strict temporal synchronization among the various processing steps; however, this is not the most desirable functioning in a large number of scenarios. For example, considering dataflows that continuously analyze data upon the insertion/update of new entries in a data store, it would be wise to assess the level of modifications in data, before the trigger of the dataflow, that would minimize the number of executions (processing steps), reducing overhead and augmenting performance, while maintaining the dataflow processing results within certain coverage and freshness limit. Towards this end, we introduce the notion of Quality-of-Data (QoD), which describes the level of modifications necessary on a data store to trigger processing steps, and thus conveying in the level of performance specified through data requirements. Also, this notion can be specially beneficial in cloud computing, where a dataflow computing service (SaaS) may provide certain QoD levels for different budgets. In this article we propose Flu., a novel dataflow model, with framework and programming library support, for orchestrating data-based processing steps, over a NoSQL data store, whose triggering is based on the evaluation and dynamic enforcement of QoD constraints that are defined (and possibly adjusted automatically) for different sets of data. With Flu. we demonstrate how dataflows can be leveraged to respond to quality boundaries that bring controlled and augmented performance, rationalization of resources, and task prioritization.
引用
收藏
页码:1 / 23
页数:23
相关论文
共 30 条
[1]   Data-Intensive Science in the US DOE: Case Studies and Future Challenges [J].
Ahrens, James P. ;
Hendrickson, Bruce ;
Long, Gabrielle ;
Miller, Steve ;
Ross, Robert ;
Williams, Dean .
COMPUTING IN SCIENCE & ENGINEERING, 2011, 13 (06) :14-23
[2]  
Altintas I, 2004, SCI STAT DAT MAN INT
[3]   Conditional workflow management: A survey and analysis [J].
Bahsi, Emir M. ;
Ceyhan, Emrah ;
Kosar, Tevfik .
SCIENTIFIC PROGRAMMING, 2007, 15 (04) :283-297
[4]  
Bhatotia Pramod, 2011, P 2 ACM S CLOUD COMP, DOI [10.1145/2038916.2038923, DOI 10.1145/2038916.2038923]
[5]   SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets [J].
Chaiken, Ronnie ;
Jenkins, Bob ;
Larson, Per-Ake ;
Ramsey, Bill ;
Shakib, Darren ;
Weaver, Simon ;
Zhou, Jingren .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (02) :1265-1276
[6]  
Chang F., 2006, P 7 USENIX S OP SYST, V7, P15
[7]  
Couvares P, 2007, WORKFLOWS E SCI, P357, DOI DOI 10.1007/978-1-84628-757-2_22
[8]  
Dean J., 2004, P 6 C S OP SYST DES, V6, P10, DOI DOI 10.HTTP://DL.ACM.0RG/CITATI0N.CFM?
[9]  
Deelman Ewa, 2006, P 2 IEEE INT C E SCI, P14, DOI [10.1109/E-SCIENCE.2006.99, DOI 10.1109/E-SCIENCE.2006.99]
[10]  
Falgout J, 2011, DATAFLOW PROGRAMMING