Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences

被引：0

作者：

Liangxiu Han

Hwee Yong Ong

机构：

[1] Manchester Metropolitan University,School of Computing, Mathematics and Digital Technology

[2] University of Edinburgh,School of Informatics

来源：

Cluster Computing | 2015年 / 18卷

关键词：

Data-intensive computing; Parallel processing; MapReduce; Cloud computing; Data mining application in biomedical science;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Performance is an open issue in data intensive applications (e.g. data mining tasks). Parallel and distributed computing systems (e.g. multicore computing, grid computing, cloud computing,etc.), along with hybrid programming models (e.g. MapReduce, MPI, etc.), is seen a sought-after solution for accelerating data-intensive applications. One of main challenges is how to exploit these advanced technologies effectively in facilitating fundamental science discoveries such as those in Biomedical Sciences. This paper explores how MapReduce and Cloud computing can accelerate performance of data intensive applications through a real data mining use case in the Biomedical Sciences. We have first adapted the data mining task using MapReduce model and then deployed it onto the Cloud. We have built an analytic model based on the MapReduce computations to evaluate the efficiency and performance of the prototype. The results, from both experiments and the evaluation model, show the performance and scalability can be enhanced through these advanced technologies.

引用

页码：403 / 418

页数：15

共 68 条

[1]

Beynon MD(2001)Distributed processing of very large datasets with DataCutter Parallel Comput. 27 1457-1478

[2]

Kurc T(1967)Nearest neighbor pattern classification IEEE Trans. Inf. Theory 30 21-27

[3]

Catalyurek U(2005)Pegasus: a framework for mapping complex scientific workflows onto distributed systems Sci. Program. 13 219-237

[4]

Chang C(2008)Hardware technologies for high-performance data-intensive computing IEEE Comput. 41 60-68

[5]

Sussman A(2008)Data-intensive computing in the 21st century Computer 41 30-32

[6]

Saltz J(2011)Automatically identifying and annotating mouse embryo gene expression patterns Bioinformatics 27 1101-1107

[7]

Cover T(2011)A generic parallel processing model for facilitating data mining and data integration J. Parallel Comput. 37 157-171

[8]

Hart P(2005)Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance IEEE Trans. Knowl. Data Eng. 17 71-89

[9]

Deelman E(2005)Workflow concepts of the Java Cog Kit Grid Comput. 3 239-258

[10]

Singh G(2006)Taverna: lessons in creating a workflow environment for the life sciences Concurr. Comput. 18 1067-1100

← 1 2 3 4 5 6 7 →