An improved partitioning mechanism for optimizing massive data analysis using MapReduce

被引：0

作者：

Kenn Slagter

Ching-Hsien Hsu

Yeh-Ching Chung

Daqiang Zhang

机构：

[1] National Tsing Hua University,Department of Computer Science

[2] Chung Hua University,Department of Computer Science

[3] Tongji University,School of Software Engineering

来源：

The Journal of Supercomputing | 2013年 / 66卷

关键词：

TeraSort; MapReduce; Load balance; Partitioning; Sampling; Cloud computing; Hadoop;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.

引用

页码：539 / 555

页数：16

共 50 条

[21] Parallel Processing of Massive EEG Data with MapReduce
Wang, Lizhe
Chen, Dan
Ranjan, Rajiv
Khan, Samee U.
Kolodziej, Joanna
Wang, Jun
PROCEEDINGS OF THE 2012 IEEE 18TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2012), 2012, : 164 - 171
[22] Data Analysis using Hadoop MapReduce Environment
Merla, PrathyushaRani
Liang, Yiheng
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4783 - 4785
[23] Genome Data Analysis using MapReduce Paradigm
Pahadia, Mayank
Srivastava, Akash
Srivastava, Divyang
Patil, Nagamma
2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 556 - 559
[24] MapReduce-Based Improved Random Forest Model for Massive Educational Data Processing and Classification
Wei Xu
Vinh Truong Hoang
Mobile Networks and Applications, 2021, 26 : 191 - 199
[25] MapReduce-Based Improved Random Forest Model for Massive Educational Data Processing and Classification
Xu, Wei
Hoang, Vinh Truong
MOBILE NETWORKS & APPLICATIONS, 2021, 26 (01): : 191 - 199
[26] Parallel similarity joins on massive high-dimensional data using MapReduce
Ma, Youzhong
Meng, Xiaofeng
Wang, Shaoya
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (01): : 166 - 183
[27] DPM: Data Partitioning Method for Pipelined MapReduce on GPU
Jo, Myung Hyun
Ro, Won Woo
18TH IEEE INTERNATIONAL SYMPOSIUM ON CONSUMER ELECTRONICS (ISCE 2014), 2014,
[28] Handling partitioning skew in MapReduce using LEEN
Ibrahim, Shadi
Jin, Hai
Lu, Lu
He, Bingsheng
Antoniu, Gabriel
Wu, Song
PEER-TO-PEER NETWORKING AND APPLICATIONS, 2013, 6 (04) : 409 - 424
[29] Handling partitioning skew in MapReduce using LEEN
Shadi Ibrahim
Hai Jin
Lu Lu
Bingsheng He
Gabriel Antoniu
Song Wu
Peer-to-Peer Networking and Applications, 2013, 6 : 409 - 424
[30] MapReduce based improved quick reduct algorithm with granular refinement using vertical partitioning scheme
Sowkuntla, Pandu
Prasad, P. S. V. S. Sai
KNOWLEDGE-BASED SYSTEMS, 2020, 189

← 1 2 3 4 5 →