DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

被引:0
|
作者
Shang, Pengju [1 ]
Xiao, Qiangju [1 ]
Wang, Jun [1 ]
机构
[1] Univ Cent Florida, Orlando, FL 32816 USA
来源
2012 DIGEST ASIA-PACIFIC MAGNETIC RECORDING CONFERENCE (APMRC) | 2012年
基金
美国国家科学基金会;
关键词
MapReduce; Hadoop; Data-intensive; Data layout;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recent years have seen an increasing number of scientists employ data parallel computing frameworks such as MapReduce and Hadoop to run data intensive applications and conduct analysis. In these co-located compute and storage frameworks, a wise data placement scheme can significantly improve the performance. Existing data parallel frameworks, e.g. Hadoop, or Hadoop-based clouds, distribute the data using a random placement method for simplicity and load balance. However, we observe that many data intensive applications exhibit interest locality which only sweep part of a big data set. The data often accessed together result from their grouping semantics. Without taking data grouping into consideration, the random placement does not perform well and is way below the efficiency of optimal data distribution. In this paper, we develop a new Data-gRouping-AWare (DRAW) data placement scheme to address the above-mentioned problem. DRAW dynamically scrutinizes data access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group subjective to load balance. By experimenting two real-world MapReduce applications with different data placement schemes on a 40-node test bed, we conclude that DRAW increases the total number of local map tasks executed up to 59.8%, reduces the completion latency of the map phase up to 41.7%, and improves the overall performance by 36.4%, in comparison with Hadoop's default random placement.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Dache: A Data Aware Caching for Big-Data Applications Using The MapReduce Framework
    Zhao, Yaxiong
    Wu, Jie
    2013 PROCEEDINGS IEEE INFOCOM, 2013, : 35 - 39
  • [22] Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework
    Yaxiong Zhao
    Jie Wu
    Cong Liu
    Tsinghua Science and Technology, 2014, 19 (01) : 39 - 50
  • [23] Dache: A data aware caching for big-data applications using the MapReduce framework
    Zhao, Y. (yaxiongzhao@google.com), 1600, Tsinghua University (19): : 39 - 50
  • [24] Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework
    Zhao, Yaxiong
    Wu, Jie
    Liu, Cong
    TSINGHUA SCIENCE AND TECHNOLOGY, 2014, 19 (01) : 39 - 50
  • [25] Heterogeneity-Aware Data Placement in Hybrid Clouds
    Marquez, Jack D.
    Gonzalez, Juan D.
    Mondragon, Oscar H.
    CLOUD COMPUTING - CLOUD 2019, 2019, 11513 : 177 - 191
  • [26] A Cost-Aware Resource Selection Approach for Data-intensive Applications in Grids
    Liu, Wei
    Shi, Feiyan
    Li, Hongfeng
    Xu, Zhihao
    2ND INTERNATIONAL SYMPOSIUM ON COMPUTER NETWORK AND MULTIMEDIA TECHNOLOGY (CNMT 2010), VOLS 1 AND 2, 2010, : 182 - 185
  • [27] Topology-Aware Resource Allocation for Data-Intensive Workloads
    Lee, Gunho
    Tolia, Niraj
    Ranganathan, Parthasarathy
    Katz, Randy H.
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (01) : 120 - 124
  • [28] Data intensive and network aware (DIANA) grid scheduling
    McClatchey R.
    Anjum A.
    Stockinger H.
    Ali A.
    Willers I.
    Thomas M.
    J. Grid Comput., 2007, 1 (43-64): : 43 - 64
  • [29] The Initial Data-Placement-Plan of Data-Insensitive Applications in Cloud
    Li Hong-jin
    2012 INTERNATIONAL CONFERENCE ON INTELLIGENCE SCIENCE AND INFORMATION ENGINEERING, 2012, 20 : 119 - 121
  • [30] GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers
    Convolbo, Moise W.
    Chou, Jerry
    Hsu, Ching-Hsien
    Chung, Yeh Ching
    COMPUTING, 2018, 100 (01) : 21 - 46