An improved partitioning mechanism for optimizing massive data analysis using MapReduce

被引:0
|
作者
Kenn Slagter
Ching-Hsien Hsu
Yeh-Ching Chung
Daqiang Zhang
机构
[1] National Tsing Hua University,Department of Computer Science
[2] Chung Hua University,Department of Computer Science
[3] Tongji University,School of Software Engineering
来源
The Journal of Supercomputing | 2013年 / 66卷
关键词
TeraSort; MapReduce; Load balance; Partitioning; Sampling; Cloud computing; Hadoop;
D O I
暂无
中图分类号
学科分类号
摘要
In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.
引用
收藏
页码:539 / 555
页数:16
相关论文
共 50 条
  • [1] An improved partitioning mechanism for optimizing massive data analysis using MapReduce
    Slagter, Kenn
    Hsu, Ching-Hsien
    Chung, Yeh-Ching
    Zhang, Daqiang
    JOURNAL OF SUPERCOMPUTING, 2013, 66 (01): : 539 - 555
  • [2] A Micropartitioning Technique for Massive Data Analysis Using MapReduce
    Mohanapriya, S.
    Natesan, P.
    2014 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2014,
  • [3] Analysis of Massive Industrial Data using MapReduce Framework for Parallel Processing
    Aly, Mohab
    Yacout, Soumaya
    Shaban, Yasser
    2017 ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, 2017,
  • [4] Massive Image Data Management using HBase and MapReduce
    Liu, Yuehu
    Chen, Bin
    He, Wenxi
    Fang, Yu
    2013 21ST INTERNATIONAL CONFERENCE ON GEOINFORMATICS (GEOINFORMATICS), 2013,
  • [5] A Balanced Partitioning Mechanism Using Collapsed-Condensed Trie in MapReduce
    Chen, Hsing-Lung
    Chen, Syu-Huan
    2018 IEEE 8TH INTERNATIONAL SYMPOSIUM ON CLOUD AND SERVICE COMPUTING (SC2), 2018, : 97 - 102
  • [6] Locality Based Data Partitioning in MapReduce
    Wang, Chunguang
    Wu, Qingbo
    Tan, Yusong
    Wang, Wenzhu
    Wu, Quanyuan
    2013 IEEE 16TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE 2013), 2013, : 1310 - 1317
  • [7] An Iterative Data Partitioning Strategy for MapReduce
    Zhang Y.-M.
    Jiang J.-B.
    Lu J.-W.
    Xu J.
    Xiao G.
    Jisuanji Xuebao/Chinese Journal of Computers, 2019, 42 (08): : 1873 - 1885
  • [8] Set similarity join on massive probabilistic data using MapReduce
    Ma, Youzhong
    Meng, Xiaofeng
    DISTRIBUTED AND PARALLEL DATABASES, 2014, 32 (03) : 447 - 464
  • [9] Set similarity join on massive probabilistic data using MapReduce
    Youzhong Ma
    Xiaofeng Meng
    Distributed and Parallel Databases, 2014, 32 : 447 - 464
  • [10] Optimizing Cloud MapReduce for Processing Stream Data using Pipelining
    Karve, Rutvik
    Dahiphale, Devendra
    Chhajer, Amit
    UKSIM FIFTH EUROPEAN MODELLING SYMPOSIUM ON COMPUTER MODELLING AND SIMULATION (EMS 2011), 2011, : 344 - 349