A Workflow Management System for Scalable Data Mining on Clouds

被引:51
作者
Marozzo, Fabrizio [1 ]
Talia, Domenico [1 ]
Trunfio, Paolo [1 ]
机构
[1] Univ Calabria, DIMES, I-87036 Arcavacata Di Rende, CS, Italy
关键词
Workflows; data analysis; cloud computing; software-as-a-service; scalability; FRAMEWORK; SERVICES;
D O I
10.1109/TSC.2016.2589243
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The extraction of useful information from data is often a complex process that can be conveniently modeled as a data analysis workflow. When very large data sets must be analyzed and/or complex data mining algorithms must be executed, data analysis workflows may take very long times to complete their execution. Therefore, efficient systems are required for the scalable execution of data analysis workflows, by exploiting the computing services of the Cloud platforms where data is increasingly being stored. The objective of the paper is to demonstrate how Cloud software technologies can be integrated to implement an effective environment for designing and executing scalable data analysis workflows. We describe the design and implementation of the Data Mining Cloud Framework (DMCF), a data analysis system that integrates a visual workflow language and a parallel runtime with the Software-as-a-Service (SaaS) model. DMCF was designed taking into account the needs of real data mining applications, with the goal of simplifying the development of data mining applications compared to generic workflow management systems that are not specifically designed for this domain. The result is a high-level environment that, through an integrated visual workflow language, minimizes the programming effort, making easier to domain experts the use of common patterns specifically designed for the development and the parallel execution of data mining applications. The DMCF's visual workflow language, system architecture and runtime mechanisms are presented. We also discuss several data mining workflows developed with DMCF and the scalability obtained executing such workflows on a public Cloud.
引用
收藏
页码:480 / 492
页数:13
相关论文
共 41 条
[1]   Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support [J].
Abouelhoda, Mohamed ;
Issa, Shadi Alaa ;
Ghanem, Moustafa .
BMC BIOINFORMATICS, 2012, 13
[2]  
Agapito Giuseppe, 2013, P INT C BIOINF COMP, P468, DOI [10.1145/2506583.2506605, DOI 10.1145/2506583.2506605]
[3]  
Allweyer Thomas, 2010, BPMN 2.0
[4]   Trajectory Pattern Mining over a Cloud-based Framework for Urban Computing [J].
Altomare, Albino ;
Cesario, Eugenio ;
Comito, Carmela ;
Marozzo, Fabrizio ;
Talia, Domenico .
2014 IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2014 IEEE 6TH INTL SYMP ON CYBERSPACE SAFETY AND SECURITY, 2014 IEEE 11TH INTL CONF ON EMBEDDED SOFTWARE AND SYST (HPCC,CSS,ICESS), 2014, :367-374
[5]  
[Anonymous], 2009, SIGKDD Explorations, DOI DOI 10.1145/1656274.1656278
[6]  
[Anonymous], 2004, COMBINING PATTERN CL, DOI DOI 10.1002/0471660264
[7]  
[Anonymous], 2009, Microsoft Research
[8]  
[Anonymous], 2014, C4. 5: programs for machine learning
[9]  
[Anonymous], 2014, The SAGE Handbook of Social Network Analysis
[10]  
[Anonymous], 2013 IEEE INT C CLUS