An improved partitioning mechanism for optimizing massive data analysis using MapReduce

被引:0
|
作者
Kenn Slagter
Ching-Hsien Hsu
Yeh-Ching Chung
Daqiang Zhang
机构
[1] National Tsing Hua University,Department of Computer Science
[2] Chung Hua University,Department of Computer Science
[3] Tongji University,School of Software Engineering
来源
The Journal of Supercomputing | 2013年 / 66卷
关键词
TeraSort; MapReduce; Load balance; Partitioning; Sampling; Cloud computing; Hadoop;
D O I
暂无
中图分类号
学科分类号
摘要
In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.
引用
收藏
页码:539 / 555
页数:16
相关论文
共 50 条
  • [21] Parallel Processing of Massive EEG Data with MapReduce
    Wang, Lizhe
    Chen, Dan
    Ranjan, Rajiv
    Khan, Samee U.
    Kolodziej, Joanna
    Wang, Jun
    PROCEEDINGS OF THE 2012 IEEE 18TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2012), 2012, : 164 - 171
  • [22] Data Analysis using Hadoop MapReduce Environment
    Merla, PrathyushaRani
    Liang, Yiheng
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4783 - 4785
  • [23] Genome Data Analysis using MapReduce Paradigm
    Pahadia, Mayank
    Srivastava, Akash
    Srivastava, Divyang
    Patil, Nagamma
    2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 556 - 559
  • [24] MapReduce-Based Improved Random Forest Model for Massive Educational Data Processing and Classification
    Wei Xu
    Vinh Truong Hoang
    Mobile Networks and Applications, 2021, 26 : 191 - 199
  • [25] MapReduce-Based Improved Random Forest Model for Massive Educational Data Processing and Classification
    Xu, Wei
    Hoang, Vinh Truong
    MOBILE NETWORKS & APPLICATIONS, 2021, 26 (01): : 191 - 199
  • [26] Parallel similarity joins on massive high-dimensional data using MapReduce
    Ma, Youzhong
    Meng, Xiaofeng
    Wang, Shaoya
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (01): : 166 - 183
  • [27] DPM: Data Partitioning Method for Pipelined MapReduce on GPU
    Jo, Myung Hyun
    Ro, Won Woo
    18TH IEEE INTERNATIONAL SYMPOSIUM ON CONSUMER ELECTRONICS (ISCE 2014), 2014,
  • [28] Handling partitioning skew in MapReduce using LEEN
    Ibrahim, Shadi
    Jin, Hai
    Lu, Lu
    He, Bingsheng
    Antoniu, Gabriel
    Wu, Song
    PEER-TO-PEER NETWORKING AND APPLICATIONS, 2013, 6 (04) : 409 - 424
  • [29] Handling partitioning skew in MapReduce using LEEN
    Shadi Ibrahim
    Hai Jin
    Lu Lu
    Bingsheng He
    Gabriel Antoniu
    Song Wu
    Peer-to-Peer Networking and Applications, 2013, 6 : 409 - 424
  • [30] MapReduce based improved quick reduct algorithm with granular refinement using vertical partitioning scheme
    Sowkuntla, Pandu
    Prasad, P. S. V. S. Sai
    KNOWLEDGE-BASED SYSTEMS, 2020, 189