DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

被引:0
|
作者
Shang, Pengju [1 ]
Xiao, Qiangju [1 ]
Wang, Jun [1 ]
机构
[1] Univ Cent Florida, Orlando, FL 32816 USA
来源
2012 DIGEST ASIA-PACIFIC MAGNETIC RECORDING CONFERENCE (APMRC) | 2012年
基金
美国国家科学基金会;
关键词
MapReduce; Hadoop; Data-intensive; Data layout;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recent years have seen an increasing number of scientists employ data parallel computing frameworks such as MapReduce and Hadoop to run data intensive applications and conduct analysis. In these co-located compute and storage frameworks, a wise data placement scheme can significantly improve the performance. Existing data parallel frameworks, e.g. Hadoop, or Hadoop-based clouds, distribute the data using a random placement method for simplicity and load balance. However, we observe that many data intensive applications exhibit interest locality which only sweep part of a big data set. The data often accessed together result from their grouping semantics. Without taking data grouping into consideration, the random placement does not perform well and is way below the efficiency of optimal data distribution. In this paper, we develop a new Data-gRouping-AWare (DRAW) data placement scheme to address the above-mentioned problem. DRAW dynamically scrutinizes data access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group subjective to load balance. By experimenting two real-world MapReduce applications with different data placement schemes on a 40-node test bed, we conclude that DRAW increases the total number of local map tasks executed up to 59.8%, reduces the completion latency of the map phase up to 41.7%, and improves the overall performance by 36.4%, in comparison with Hadoop's default random placement.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] Application and Storage-Aware Data Placement and Job Scheduling for Hadoop Clusters
    Li, Tao
    He, Shuibing
    Chen, Ping
    Yang, Siling
    Yin, Yanlong
    Xu, Cheng
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2020, 29 (16)
  • [32] RENDA: Resource and Network Aware Data Placement Algorithm for Periodic Workloads in Cloud
    Thakkar, Hiren Kumar
    Sahoo, Prasan Kumar
    Veeravalli, Bharadwaj
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (12) : 2906 - 2920
  • [33] DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters
    Jin, Jiahui
    An, Qi
    Zhou, Wei
    Tang, Jiakai
    Xiong, Runqun
    APPLIED SCIENCES-BASEL, 2018, 8 (11):
  • [34] An Identification Algorithm in Grouping and Paralleling for Data-Intensive RFID Systems
    Duan Litian
    Zizhong, Wang John
    Fu, Duan
    BIG DATA COMPUTING AND COMMUNICATIONS, 2015, 9196 : 337 - 346
  • [35] Data-intensive applications, challenges, techniques and technologies: A survey on Big Data
    Chen, C. L. Philip
    Zhang, Chun-Yang
    INFORMATION SCIENCES, 2014, 275 : 314 - 347
  • [36] ExoApp: Performance Evaluation of Data-Intensive Applications on ExoGENI
    Yu, Ze
    Liu, Xinxin
    Li, Min
    Liu, Kaikai
    Li, Xiaolin
    2013 SECOND GENI RESEARCH AND EDUCATIONAL EXPERIMENT WORKSHOP (GREE), 2013, : 25 - 28
  • [37] A Survey of Semantics-Aware Performance Optimization for Data-Intensive Computing
    Rao, Bingbing
    Wang, Liqang
    2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, 2017, : 81 - 88
  • [38] An Efficiency-Aware Scheduling for Data-Intensive Computations on MapReduce Clusters
    Zhao, Hui
    Yang, Shuqiang
    Fan, Hua
    Chen, Zhikun
    Xu, Jinghu
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (12): : 2654 - 2662
  • [39] RING: NUMA-aware Message-batching Runtime for Data-intensive Applications
    Meng, Ke
    Tan, Guangming
    2017 IEEE 23RD INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2017, : 368 - 375
  • [40] MapReduce Across Distributed Clusters for Data-intensive Applications
    Wang, Lizhe
    Tao, Jie
    Marten, Holger
    Streit, Achim
    Khan, Samee U.
    Kolodziej, Joanna
    Chen, Dan
    2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 2004 - 2011