An improved partitioning mechanism for optimizing massive data analysis using MapReduce

被引：0

作者：

Kenn Slagter

Ching-Hsien Hsu

Yeh-Ching Chung

Daqiang Zhang

机构：

[1] National Tsing Hua University,Department of Computer Science

[2] Chung Hua University,Department of Computer Science

[3] Tongji University,School of Software Engineering

来源：

The Journal of Supercomputing | 2013年 / 66卷

关键词：

TeraSort; MapReduce; Load balance; Partitioning; Sampling; Cloud computing; Hadoop;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.

引用

页码：539 / 555

页数：16

共 50 条

[1] An improved partitioning mechanism for optimizing massive data analysis using MapReduce
Slagter, Kenn
Hsu, Ching-Hsien
Chung, Yeh-Ching
Zhang, Daqiang
JOURNAL OF SUPERCOMPUTING, 2013, 66 (01): : 539 - 555
[2] A Micropartitioning Technique for Massive Data Analysis Using MapReduce
Mohanapriya, S.
Natesan, P.
2014 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2014,
[3] Analysis of Massive Industrial Data using MapReduce Framework for Parallel Processing
Aly, Mohab
Yacout, Soumaya
Shaban, Yasser
2017 ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, 2017,
[4] Massive Image Data Management using HBase and MapReduce
Liu, Yuehu
Chen, Bin
He, Wenxi
Fang, Yu
2013 21ST INTERNATIONAL CONFERENCE ON GEOINFORMATICS (GEOINFORMATICS), 2013,
[5] A Balanced Partitioning Mechanism Using Collapsed-Condensed Trie in MapReduce
Chen, Hsing-Lung
Chen, Syu-Huan
2018 IEEE 8TH INTERNATIONAL SYMPOSIUM ON CLOUD AND SERVICE COMPUTING (SC2), 2018, : 97 - 102
[6] Locality Based Data Partitioning in MapReduce
Wang, Chunguang
Wu, Qingbo
Tan, Yusong
Wang, Wenzhu
Wu, Quanyuan
2013 IEEE 16TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE 2013), 2013, : 1310 - 1317
[7] An Iterative Data Partitioning Strategy for MapReduce
Zhang Y.-M.
Jiang J.-B.
Lu J.-W.
Xu J.
Xiao G.
Jisuanji Xuebao/Chinese Journal of Computers, 2019, 42 (08): : 1873 - 1885
[8] Set similarity join on massive probabilistic data using MapReduce
Ma, Youzhong
Meng, Xiaofeng
DISTRIBUTED AND PARALLEL DATABASES, 2014, 32 (03) : 447 - 464
[9] Set similarity join on massive probabilistic data using MapReduce
Youzhong Ma
Xiaofeng Meng
Distributed and Parallel Databases, 2014, 32 : 447 - 464
[10] Optimizing Cloud MapReduce for Processing Stream Data using Pipelining
Karve, Rutvik
Dahiphale, Devendra
Chhajer, Amit
UKSIM FIFTH EUROPEAN MODELLING SYMPOSIUM ON COMPUTER MODELLING AND SIMULATION (EMS 2011), 2011, : 344 - 349

← 1 2 3 4 5 →